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This  contract  has  three  primary  designated  research  areas:  distributed  communi¬ 
cations  access,  distributed  processing,  and  distributed  control  and  algorithms. 

This  report  contains  the  abstracts  of  the  publications  which  summarize  our 
research  results  in  those  areas  during  this  semi-aimual  period,  followed  by  the 
main  body  of  the  report  which  consists  of  the  Ph.D.  ^ssertation  by  Kenneth 
Kung,  "Concurrency  in  Parallel  Processing  Systems”,  conducted  under  the  super¬ 
vision  of  Professor  Leonard  Kleinrock  (Mncipal  Investigator  for  this  contract). 
A  model  for  parallel  processing  is  introduced  using  the  graph  model  of  computa¬ 
tion.  Four  key  classification  parameters  are  considered:  the  input  (either  a  fixed 
number  of  jobs,  k,  or  a  random  arrival  process  with  rate  X  jobs/scc);  die  structure 
of  the  graph  (either  a  fixed  graph  structure,  G,  or  a  random  graph,  C*);  the  ser¬ 
vice  time  per  task  (either  a  fixed  service  time,  %,  or  a  random  service  time,  %*); 
and  the  number  of  processors  (either  a  finite  number,  P,  or  an  infinite  numba 
P=«®). 

For  the  cases  (k,  G,  x*,P=*»)  and  (X,C,x*.P=“),  we  set  up  a  Maikov  chain  for 
the  number  of  concurrent  tasks  which  solve  for  the  equilibrium  probabilities. 
From  this  we  also  find  the  mean  system  response  time  and  the  sp^up  factor. 
Using  simpler  methods,  we  place  upper  and  lower  bounds  on  the  speedup.  For  the 
case  (k,  G,  x./’<  «>)  we  also  bound  Ae  speedup  for  the  special  case  of  diamond¬ 
shaped  process  graphs. 

For  random  graphs  with  fixed  service  time  (k,  C *, x.  P*  ••).  we  find  that,  as  N,  the 
number  of  tasks  per  job,  approaches  infinity,  the  speedup  simply  approaches  2! 

For  random  service  times,  we  find  that  the  speedup,  S,  is  bounded  by  y  £  5  s  2. 

We  also  present  results  for  the  optimal  number  of  processes,  P,  for  diamond¬ 
shaped  graphs  for  the  case  (k,  G*,x,P<">)  where  the  objective  is  to  maximize 
"power"  defined  as  throughput  divide]  by  response  time. 

The  issue  of  communication  overhead  is  also  addressed  in  this  work  and  we 
study  the  effect  of  this  overhead  on  the  gains  that  are  achieved  from  parallel  pro¬ 
cessing. 
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INTRODUCTION 

This  Semi  -  Annual  Technical  Report  covers  research  carried  out  by  the  Advanced  Teleprocessing 
Systems  Group  at  UCLA  under  DARPA  Contract  No.  MDA  903-82-C-0064  covering  the  period 
from  October  1,  1983  to  March  31,  1984.  Under  this  contract  we  have  three  designated  tasks  as 
follows: 


TASK  1.  DISTRIBUTED  COMMUNICATIONS  ACCESS 

The  general  problem  of  sharing  a  multi-access  broadcast  distributed  sys¬ 
tems  among  a  set  of  competing  users  will  be  studied.  General  issues  in¬ 
volving  exhaustive  communications,  start-up  problems  and  refined 
models  to  manifest  some  more  realistic  phenomena  in  these  systems  will 
be  studied.  Applications  to  packet  radio  systems  and  large  survivable 
networks  involving  the  study  of  tandem  networks,  multi-hop  networks, 
one-way  communication  links,  correct  reception  of  more  than  one  simul¬ 
taneous  transmission  and  mobility  will  be  included.  Purdier  applications 
will  include  the  study  of  very  hi^  bandwidth  channels  and/or  very  long 
propagation  delay  systems,  multiple  token  systems  and  compround 
hierarchical  network  structures. 


TASK  II.  DISTRIBUTED  PROCESSING 

The  interplay  between  distributed  communications  in  a  broadcast  en¬ 
vironment  and  processing  of  distributed  data  will  be  studied.  For  exam¬ 
ple,  the  effect  of  merging  sorted  lists  in  a  broadcast  environment,  as  well 
as  finding  properties  of  elements  in  these  lists,  will  be  studied.  Con¬ 
currency  in  multiprocessor  systems  will  be  studied  in  order  to  investigate 
performance  in  terms  of  response  time  and  speedup  factors  for  various 
graph  models  of  computation.  Connection  architectures  for  multiproces¬ 
sor  systems  will  be  investigated  as  well.  One  application  here  is  the 
structure  of  the  processing  and  communication  ar^tecture  for  super- 
computera. 


TASK  in.  DISTRIBUTED  CONTROL  AND  ALGORITHMS 


Routing,  flow  control  and  survivability  in  large  packet  radio  networks  as 
well  as  in  public  data  networks  will  be  studied  as  control  algorithms  in  a 
distributed  environment  Measures  of  performance,  including 
throughput,  response  time,  blocking,  power,  fairness,  and  robustness  will 
be  applied  to  these  systems.  Distributed  algorithms  for  finding  shortest 
paths,  connectivity,  loops,  etc.  will  be  studied.  The  effect  of  node  and 
link  failures,  limited  amounts  of  memory  at  each  node  and  restricted 
channel  capacity  for  communications  will  be  investigated.  The  effect  of 
network  failures  and  delays  on  distributed  data  base  management  systems 
will  also  be  studied. 


A  major  contribution  of  our  research  during  this  reporting  period  is  contained  in  Reference  4  list- 
ed  below,  namely  "Concurrency  in  Parallel  Processing  Systems”,  by  Kenneth  Kung.  this  disserta¬ 
tion  was  supervised  by  Professor  Leonard  Kleinrock  (Principal  Investigator  for  this  research). 

A  model  for  parallel  processing  is  introduced  using  the  graph  model  of  computation.  Four  key 
classification  parameters  are  considered:  tire  input  (either  a  fixed  number  of  jobs,  k,  or  a  random 
arrival  process  with  rate  X  jobs/scc);  the  structure  of  the  graph  (either  a  fixed  gnfh  stracture,  G,  or 
a  random  graph,  G'y,  the  service  time  per  task  (either  a  fixed  service  time,  x>  or  a  random  service 
time,  X* );  arid  the  number  of  processors  (either  a  finite  number,  P,  or  an  infinite  number  P=  ee). 

For  the  cases  (k,  G,  x*,P=«')  and  (X,  G,x*.P='“).  we  set  up  a  Markov  chain  for  the  number  of 
concurrent  tasks  which  solve  for  the  equilibrium  probabilities.  From  this  we  also  find  the  mean 
system  response  time  and  the  speedup  factor.  Using  simpler  methods,  we  place  upper  and  lower 
bounds  on  the  speedup.  For  the  case  (k,  G,  x>  P  <  «*)  we  also  bound  Ae  speedup  for  the  special 
case  of  diamoncf-shap^  process  graphs. 

For  random  graphs  with  fixed  service  time  (k,  G*,x,P=<«),  we  find  that,  as  N,  the  number  of 
tasks  per  job,  approaches  infinity,  the  speedup  simply  approaches  2!  For  random  service  times, 

we  find  that  the  speedup,  S,  is  bounded  by  y  ^  5  ^  2. 

We  also  present  results  for  the  optimal  number  of  processes,  P,  for  diamond-  shaped  graphs  for 
the  case  (k,  G ' ,  x,  P  <  «»)  where  the  objective  is  to  maximize  "power”  defined  as  throughput  divid¬ 
ed  by  response  time. 

The  issue  of  communication  oveihead  is  also  addressed  in  this  work  and  we  study  the  effect  of 
this  overhead  on  the  gains  that  are  achieved  from  parallel  processing.  The  entire  dissertation  is 
reproduced  as  the  main  body  of  this  report  The  following  list  of  research  publications  summar¬ 
izes  the  results  of  the  semi-armual  period  and  the  abstract  of  each  paper  is  given  along  with  the 
reference  itself. 


RESEARCH  PUBLICATIONS 


Kleinrock,  L.  and  G.  Akavia,  "On  a  Self  Adjusting  Capability  of  Random  Access 
Networks",  IEEE  Transactions  on  Convnunications,  January  1984,  Vol.  Com>32,  No. 
1,  pp.  40-47. 

We  consider  a  distributed  communication  network  with  many  terminals 
which  are  distributed  in  space  and  wish  to  communicate  with  each  other 
using  a  common  radio  diannel.  Choosing  the  transmission  range  in  such 
a  network  involves  the  following  tradeoff:  a  long  range  enables  mes¬ 
sages  to  reach  their  destinations  in  a  few  hops,  but  increases  the  amount 
of  traffic  competing  for  the  chaimel  at  every  point 

We  give  a  simple  model  for  the  per-hop  delay  in  random  access  net¬ 
works,  analyze  this  tradeoff,  and  give  the  optimal  transmission  range. 

When  choosing  this  optimal  range,  as  a  function  of  specified  traffic  and 
delay  parameters,  networks  demonstrate  an  important  self-adjusting  capa¬ 
bility.  This  capability  to  adjust  to  traffic  makes  heavily  loaded  networks 
far  tetter  than  centralized  systems  (in  which  all  messages  must  reach  one 
common  destination). 

Dividing  a  terminal  population  into  power  groups  can  improve  any  ran¬ 
dom  access  system,  especially  when  tte  traffic  is  split  between  groups  in 
an  appropriate  way,  which  we  demonstrate.  But  since  networks  are  hurt 
by  destructive  interference  less  than  centralized  systems,  it  is  harder  to 
improve  them.  Using  power  groups  can  significantly  improve  centralized 
systems,  but  will  lead  to  a  smaller  relative  improvement  in  networks. 
Decomposing  the  system  into  a  hierarchy  of  ALOHA  levels,  with  only  a 
small  population  contending  at  the  top  level,  can  improve  centralized  sys¬ 
tems  but  does  not  improve  networks. 


Takagi,  H.  and  L.  Kleinrock,  "Optimal  Transmission  Ranges  for  Randomly  Distri¬ 
buted  Packet  Radio  Tentunzls"^  IEEE  Transactions  on  Communications.  Vol.  Com-32, 
No.  3,  March  1984,  pp.  246-257. 

In  multihop  packet  radio  networks  with  randomly  distributed  terminals, 
the  optimal  transmission  radii  to  maximize  the  expected  progress  of 
packets  in  desired  directions  are  determined  with  a  variety  of  transmis¬ 
sion  protocols  and  network  configurations.  It  is  shown  that  the  FM  cap¬ 
ture  phenomenon  with  slotted  ALOHA  greatly  improres  the  expect^ 
progress  over  the  system  without  capture  due  to  the  more  limited  area  of 
possibly  interfering  terminals  around  the  receiver.  The  (mini)slotted  non- 
persistent  carrier-sense-multiple-access  (CSMA)  only  slightly  outper¬ 
forms  ALOHA,  unlike  the  single-hop  case  (where  a  large  improvement  is 
available),  because  of  a  large  area  of  "hidden"  terminals  and  the  long 
vulnerable  period  generated  by  them.  As  an  example  of  an  inhomogene¬ 
ous  termini  distribution,  the  effect  of  a  gap  in  an  otherwise  randomly 
distributed  terminal  population  on  the  expected  progress  of  packets  cross¬ 
ing  the  gap  is  considered.  In  this  case,  the  disadvantage  of  using  a  large 
transmission  radius  is  demonstrated. 


Takagi,  H.  and  L.  Kleinrock,  "Dintision  Process  Approximation  For  The  Queueing 
Delay  In  Contention  Packet  Broadcasting  Systems”,  2nd  Inf  I  Symposium  on  the  Per¬ 
formance  of  Computer  Communications  Systems,  IBM,  Zurich,  March  21-23, 1984. 

The  average  packet  delay  (including  queueing  and  randomized  re¬ 
transmission  delays)  for  a  finite  number  of  random  access  users  of  a 
channel  with  infinite  buffers  is  studied.  For  a  class  of  contention-type 
memoiyless  protocols  (including  ALOHA  and  nonpersistent  CSMA),  a 
diffusion  process  approximation  for  the  joint  queue  length  distribution  is 
formulated,  and  on  the  basis  of  its  stationaiy  solution,  two  approximate 
mean  delay  formulas  are  proposed  and  examined  against  simulation. 


Kung,  Kenneth  Ching-Yu,  "Concurrency  in  Parallel  Processing  Systems",  Ph.D. 
Dissertation,  Computer  Science  Department,  University  of  California,  Los  Angeles, 
March  1984. 
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ABSTRACT  OF  THE  DISSERTATION 


Concurrency  in  Parallel  Processing  Systems 
by 

Kenneth  Ching>Yu  Kung 
Doctor  of  Philosophy  in  Computer  Science 
University  of  California,  Los  Angeles,  1984 
Professor  Leonard  Kleinrock,  Chair 

The  idea  of  multiprocessing  has  been  with  us  for  many  years.  We  would  like  to  know, 
however,  how  much  gain  (i.e.,speed>up)  is  really  achieved  when  multi-processors  are  used.  In 
this  dissertation,  we  model  a  computer  job  as  a  Directed  Acyclic  Graph  (DAG),  each  node  in  the 
DAG  representing  a  separate  task  that  can  be  processed  by  any  processor.  Four  parameters  are 
used  to  characterize  the  concurrency  problem  which  results  in  16  cases.  The  four  parameters 
are: 

1.  How  the  jobs  arrive:  either  a  fixed  number  of  jobs  at  time  zero  or  jobs  arriving  from  a 
Poisson  source; 

2.  the  DAG:  either  the  same  for  each  job  or  each  job  randomly  selecting  its  DAG; 

3.  service  time  of  each  task:  constant  or  exponentially  distributed; 

4.  the  number  of  processors:  either  a  fixed  number  or  an  infinite  number  (infinite  number 
of  processors  meaning  that  whenever  a  task  requires  a  processor,  one  will  be  available). 

For  all  cases  studied,  we  define  a  common  concurrency  measure  which  gives  a  com¬ 
parison  of  how  much  parallelism  can  be  achieved.  The  concurrency  measure  is  obtained  exactly 
for  several  cases  by  first  converting  the  DAG  into  a  Markov  chain  where  each  state  represents  a 
possible  set  of  tasks  that  can  be  executed  in  parallel.  From  this  Markov  chain,  and  by  utilizing 
a  special  feature  in  the  chain,  we  are  able  to  find  the  equilibrium  probabilities  of  each  state  and 
the  average  time  required  to  process  a  single  job.  . 


W«  also  find  upper  and  lower  bounds  for  the  concurrency  measure  for  certain  cases  stu¬ 
died.  The  upper  bound  is  found  by  synchronizing  of  the  execution  at  various  places  in  the 
DAG. 

We  present  two  algorithms  for  assigning  the  tasks  to  processors.  One  algorithm  minim¬ 
izes  the  expected  time  to  complete  all  jobs  while  the  other  algorithm  maximizes  the  utilization 
of  the  processors. 

The  communication  cost  between  any  two  tasks  that  reside  on  different  processors  is 
modeled  as  a  task.  We  study  the  effect  of  the  communication  costs  on  the  gains  that  are 
achieved  from  multi-processing. 


CHAPTER  1 
Introduction 

1.1  Distributed  Processing  In  m.  Network  of  Processors 

Central  processing  units  have  been  the  backbone  of  the  computing  centers  for  many 
years.  These  machines  are  generally  very  powerful  but  also  very  expensive.  Communication 
networks  transfer  data  among  these  central  processors  so  that  the  processing  power  of  several 
processors  may  be  combined  and  the  processing  resources  may  be  shared  with  users  of  other 
sites.  But  researchers  recognize  the  fact  that  even  though  the  processing  capabilities  of  each 
machine  are  shared  by  all  users,  the  large  communication  time  between  hosts  in  comparison 
with  memory  access  times  often  precludes  the  parallel  execution  of  the  same  job  on  more  than 
one  machine  if  the  networks  are  slow  and/or  costly.  Thus  the  processors  are  often  loosely  cou¬ 
pled  to  each  other  with  this  type  of  communication  network. 

Many  applications,  however,  require  the  high  speed  capabilities  not  achievable  with  a 
single  serial  processor.  The  quality  of  the  answer  a  processor  returns  in  the  areas  such  as 
meteorology,  cryptography,  image  processing  and  sonar  and  radar  surveillance  (HAYN82, 
POTT83,  ROSE83|  is  proportional  to  the  amount  of  computation  performed.  There  are  only 
two  avenues  to  improve  the  performance.  One  is  to  speed  up  the  processor  by  having  faster  cir¬ 
cuits,  reducing  the  logic  levels,  reducing  the  cycle  time  per  operation,  having  high  speed  algo¬ 
rithms,  and  having  better  storage  organization.  The  other  method  is  to  try  to  handle  more  than 
one  task  within  a  job  simultaneously.  The  latter  is  the  direction  taken  by  the  Japanese  Fifth 
Generation  Computer  project. 

But  despite  the  impressive  speed  of  many  of  the  latest  model  computers,  their  basic 
architecture  limits  them  to  being  serial  machines  and  hinders  their  usefulness  to  computationally 
intensive  problems.  With  the  recent  advances  in  the  design  and  fabrication  of  VLSI  circuits,  a 
computing  center  consisting  of  up  to  tens  of  thousands  of  computing  elements  can  be  built.  If 
we  can  decompose  a  large  problem  into  many  small  concurrently  executable  tasks  and  allow 
several  processors  to  work  on  them  in  parallel,  we  can  improve  the  processing  speed  not  attain¬ 
able  by  serial  machines. 


Of  course  there  is  the  comptexity  of  muitiprogrammiog  and  the  low  utilization  associ¬ 
ated  with  a  processing  center  with  so  many  processors. 

Distributed  processing  can  be  defined  as  an  architecture  that  has  no  master/stave  rela¬ 
tions  among  a  set  of  processors.  Instead,  all  processors  are  equal  and  each  can  access  any  net¬ 
work  resources  without  the  interference  from  centralized  controllers  [PARR83]. 

Each  processor  in  a  distributed  processing  network,  therefore,  needs  the  same  basic 
software  tools: 

the  operating  system  software 
the  application  software 

the  database  access  method  and  query  language 

-  a  dictionary  defining  the  location  and  structure  of  the  data  in  the  network 
a  directory  defining  the  structure  of  the  network 

-  a  standard  message  protocol. 

In  order  to  distribute  the  processing  of  one  function  among  various  machines,  these  pro¬ 
cessors  must  be  connected  in  some  fashion.  Even  though  there  is  still  some  communication 
delay,  the  delay  between  processors  in  a  locally  interconnected  switch  is  much  smaller  than  that 
in  long  haul  communication  network  as  described  before. 

Many  issues  are  involved  in  distributed  processing  among  a  network  of  processors.  In 
particular,  the  following  set  of  problems  must  be  addressed: 

1.  efficient  multi-access  communication  protocols 

2.  management  of  the  databases  —  centralized  or  distributed 

3.  network  management  —  file  directories  [POPE81],  network  resource  directories 

4.  security  and  privacy  |SCHE83] 

5.  reliability  |AVIZ81,  MAKA81,  NG80] 

6.  topology  |UPFA82l 
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7.  scheduling 

8.  language  for  parallel  processing  [HOLT78,  SCHU81,  HASE75,  HASE77| 

9.  concurrency  in  processing  jobs 

In  our  research  we  concentrate  on  the  last  item  in  the  above  list.  Concurrency  of  the 

Jobs  is  not  very  well  understood  because  the  machines  have  often  been  used  in  a  serial  fashion 
and  therefore  the  possibility  of  parallel  execution  in  a  single  job  has  not  been  extensively 
explored. 

If  we  have  a  large  number  of  processors  and  these  computing  elements  can  be  organized 
in  such  a  way  that  they  can  cooperatively  solve  a  single  problem  or  attack  many  problems 
simultaneously,  tremendous  speed  improvement  can  be  realized.  We  recognize  that  in  addition 
to  the  service  time  of  jobs  there  are  overhead  associated  with  the  organization  of  these  proces¬ 
sors.  But  this  overhead  is  limited  to  the  organization  of  the  processors  assigned  to  the  jobs 
rather  than  the  organization  of  all  processors.  Besides  the  advantage  of  the  speed,  this  type  of 
system  offers  a  distributed  processing  environment  with  increased  reliability,  availability,  expan¬ 
dability  and  better  utilization  of  resources. 

In  order  to  take  full  advantage  of  these  cooperating  processors,  we  must  understand  the 
parallelism  within  computer  jobs  and  systems.  We  wish  to  find  out  just  how  much  speed  up  is 
achievable,  how  do  we  really  coordinate  these  processors,  and  whether  the  communication 
between  the  processors  is  too  costly  for  distributed  processing.  At  the  same  time,  the  develop¬ 
ment  of  programming  languages  for  parallel  processing  must  proceed  at  a  faster  pace.  Most  of 
the  existing  languages  do  not  allow  parallel  processing.  To  pick  out  the  concurrency  in  these 
programs  requires  extensive  preprocessing.  Since  most  of  the  algorithms  are  not  sequential,  once 
the  language  for  parallel  processing  is  available,  it  will  be  easier  to  produce  programs  for  the 
multiprocessor  environment. 


1.2  Existing  Examples 

The  idea  of  performing  more  than  one  operation  simultaneously  is  at  least  140  years  old 
[KUCK77|.  In  an  October  1842  publication  Menabrea  describes  Babbage’s  lecture  (MORR61]: 
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”  ...  when  a  long  series  of  identical  computations  is  to 
be  performed,  such  as  those  required  for  the  formation  of 
numerical  tables,  the  machine  can  be  brought  into  play  so  as  to 
give  several  results  at  the  same  time,  which  will  greatly  abridge 
the  whole  amount  of  the  processes.” 


So,  clearly,  the  idea  of  parallel  processing  has  been  around  for  quite  a  while. 

Since  the  early  1960's,  there  have  been  many  attempts  to  speed  up  execution  by  giving 
the  hardware  some  multioperational  capability.  The  IBM  360/91  is  a  pipeline  machine  which 
operates  on  arrays  of  data.  ILLIAC  IV  (BARN6S)  was  a  parallel  array  machine  with  64  process¬ 
ing  elements.  CRAY-1  was  designed  specifically  for  vector  array  processing  [KOZD80].  The 
CRAY-l's  Fortran  compiler  is  designed  to  give  the  scientific  user  immediate  access  to  the 
benefits  of  CRAY-l's  vector  processing  architecture.  This  compiler  vectorizes  the  innermost  DO 
loops  such  that  they  can  be  executed  in  parallel.  |ENSL77|  contains  a  list  of  multiprocessors 
and  parallel  systems  in  chronological  order  for  the  years  form  1958  to  1977. 

A  good  example  of  using  ILLIAC  IV  is  to  find  the  solution  to  partial  differential  equa¬ 
tions.  The  difference  method  defines  the  problem  on  a  coordinate  system  with  given  boundary 
values.  Each  grid  point  then  uses  the  weighted  average  of  the  values  from  its  neighboring  grid 
points  to  find  its  own  value.  Since  each  grid  point  at  each  iteration  can  be  processed  con¬ 
currently,  we  can  define  each  grid  point  at  each  iteration  to  be  a  separate  task.  Usually  several 
of  these  tasks  are  assigned  to  one  processor.  Solutions  are  found  when  the  difference  of  the 
value  for  each  point  between  two  consecutive  iterations  is  smaller  than  a  predetermined  value  or 
the  solution  is  not  obtainable  due  to  the  unstability. 


1.3  Summary  of  Results 

In  Chapter  3,  the  concurrency  problem  is  model  by  4  parameters;  they  are: 

1.  How  the  jobs  arrive:  either  a  fixed  number  of  jobs  at  time  zero  (k  )  or  jobs  arriving  from 
a  Poisson  source  (X) 

2.  The  DAG:  either  the  same  for  each  job  {G)  or  each  job  randomly  selecting  its  DAG  (G*) 

3.  Service  time  of  each  task:  constant  (z)  or  exponentially  distributed  (z*) 
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4. 


The  number  of  processors:  either  a  fixed  number  (P)  or  an  infinite  number  (P  =  oo) 


A  process  graph  is  defined  (in  Section  3.2)  as  a  directed  acyclic  graph  where  the  nodes  represent 
the  tasks  within  a  job  and  the  edges  represent  the  precedence  relationships  among  the  tasks. 

We  use  the  shorthand  notation  ”a,  5"  where  a,  0,  7,  and  5  represent  how  the  jobs 

arrive,  the  type  of  DAG,  the  service  time  of  each  task,  and  the  number  of  processors,  respec¬ 
tively.  For  example,  i,  G,  z,  P  =  00  is  shorthand  notation  for  a  system  with  a  fixed  number  of 
jobs  at  time  zero,  a  fixed  process  graph,  a  constant  service  time  for  each  task  and  an  infinite 
number  of  processors. 

From  these  four  parameters,  we  have  sixteen  separate  cases  as  shown  in  Figure  3.6. 
Besides  the  two  trivial  cases  (i,  G,  z,  P  =  00,  and  X,  G,  z,  P  =  00)  discussed  in  Section  3.5,  the 
case  of  fixed  process  graphs  is  discussed  in  Chapter  4  and  in  Chapter  5  we  discuss  random  pro¬ 
cess  graphs.  For  all  the  cases  studied,  we  look  for  a  common  parameter  -  the  concurrency  meas¬ 
ure,  <7,  which  is  defined  (in  Section  3.3)  as  the  average  time  a  single  job  spends  in  the  system 
using  P  processors  divided  by  the  average  time  a  job  spends  in  the  system  when  only  one  pro¬ 
cessor  is  used. 


1.3.1  Fixed  Process  Graphs 

For  the  k,  G,  z*,  P  =  00  case  (Section  4.2),  we  develop  a  method  for  finding  the  average 
system  time.  Because  the  number  of  processors  is  assumed  to  be  infinite,  the  results  obtained 
are  independent  of  the  number  of  jobs,  k.  The  process  graph  G  is  first  converted  into  a  Markov 
Chain  by  Algorithm  CPM  (Section  4.2.1. 1)  where  each  state  represents  a  possible  set  of  tasks 
that  can  be  processed  in  parallel.  Since  we  know  the  rate  out  of  and  into  each  state,  we  have  a 
set  of  balance  equations.  If  we  put  these  balance  equations  in  a  matrix  format,  we  have  a  lower 
triangular  matrix  which  can  be  inverted  easily  to  obtain  the  equilibrium  state  probabilities. 
From  these  equilibrium  probabilities,  we  can  find  the  average  system  time  and  the  concurrency 
measure  a. 

Bounds  on  the  concurrency  measure  are  relatively  easier  to  obtain  than  the  exact  value. 
An  upper  bound  can  be  found  by  forcing  the  execution  of  a  job  to  synchronize  at  each  level  of 
the  process  graph.  No  tasks  in  a  level  can  start  execution  until  all  the  tasks  in  its  previous  level 
have  completed  execution.  We  first  study  the  average  time  required  for  a  node  to  wait  for  all 
its  predecessors  in  the  previous  level  to  complete.  In  each  level  there  exists  a  node  which  has 
the  maximum  number  of  edges  entering  it.  Therefore,  no  tasks  in  this  level  can  start  executing 


until  all  th«  predecessors  of  this  task  have  been  completed.  By  summing  the  time  required  at 
each  level  to  process  the  task  with  the  maximum  in>degree,  we  obtain  an  upper  bound  on  the 
average  system  time. 


A  lower  bound  is  simply  the  average  processing  time  of  a  task  multiplied  by  the  number 
of  tasks  in  the  longest  path  from  the  initial  node  to  the  terminating  node. 

For  the  X,  G,  x‘,  P  ^  oo  case  (Section  4.3),  the  results  obtained  in  Section  4.2  can  be 
applied  directly.  With  an  infinite  number  of  processors,  once  a  job  enters  the  system,  it  is 
immediately  served  and  no  waiting  time  is  required. 

We  briefiy  discuss  the  case  of  k,  G,  z*,  P  <  oo  using  the  Stochastic  Petri  Nets  model  in 
Section  4.4.  Any  process  graph  can  be  converted  into  a  Petri  Net.  A  ’place’  representing  the 
available  processors  is  added  to  the  Petri  Net  such  that  at  each  ’transition'  if  a  processor  is 
needed,  a  ’token’  is  obtained  from  this  place,  and  whenever  a  transition  with  a  processor  token 
finishes,  the  token  is  also  returned  to  this  place.  By  using  the  analysis  provided  by  Stochastic 
Petri  Nets  theory,  we  can  find  the  average  utilization  of  the  processors. 

In  Section  4.5,  we  study  the  assignment  problem  for  the  case  of  k,  G,  x,  P  <  oo.  Two 
scheduling  algorithms  are  analyzed  for  diamond>shaped  process  graphs  •  one  algorithm  gives  the 
worst  case  assignment  and  the  other  algorithm  gives  the  best  case  assignment.  By  studying  the 
ratio  of  the  average  system  time  using  the  worst  algorithm  and  the  best  algorithm,  we  find  that 
the  ratio  between  the  two  assignments  is  not  large  (less  than  two).  Therefore,  if  wr  ailo  w  for 
random  assignment  (an  available  processor  is  given  to  any  task  that  is  ready  to  execute),  the 
resulting  average  system  time  will  fall  in  between  the  two  boundary  values. 


1,3.2  Random  Proeesn  Graphs 


In  Chapter  5  we  look  at  some  of  the  properties  of  random  process  graphs.  We  find  (in 
Section  5.2.3)  that  the  number  of  arrangements  for  N  tasks  with  respect  to  the  number  of  levels 
may  be  approximated  by  a  Gaussian  distribution  (recognizing  that  this  approximation  permits  a 
negative  number  of  levels,  which  is  clearly  impossible).  Since  no  arrangements  can  have  less 
than  one  level,  we  assume  the  probability  of  any  arrangement  with  less  than  one  level  equals  to 


zero. 


N 


In  other  words,  most  arrangements  have  —  levels  as  N  becomes  large.  Using  the 


Chernoff  bound,  the  tail  probability  of  this  distribution  is  found  in  Section  5.2.4. 


In  the  case  k,  G*,  z,  P  =  oo,  as  N  approaches  infinity,  where  N  is  the  number  of  tasks 

N 

within  a  job,  we  find  the  average  system  time  approaches  the  value  of  —  multiplied  by  the 
average  task  service  time,  and  the  concurrency  measure  approaches 

For  the  case  k,  G*,  z*,  P**  oo,  we  found  and  proved  an  arrangement  that  will  provide 
the  upper  bound  for  system  time  over  all  arrangements  with  N  nodes.  An  upper  bound  is 
presented  in  Section  5.4. 1.1  while  a  lower  bound  is  presented  in  Section  5.4. 1.2.  Both  bounds 

IN  3  N 

are  expressed  probabilistically;  they  are  approximately  ~  “  where  S  is  the  aver¬ 

age  system  time  of  a  job,  N  is  the  number  of  tasks  within  a  job  and  is  the  average  service 
time  of  a  single  task. 


If  the  number  of  precedence  relationships  is  also  given,  then  we  define  the  minimally 
connected  process  graph  in  Section  5.4.2.I.  is  defined  to  be  the  minimum  number  of  edges 
required  to  fix  all  the  nodes  of  a  particular  process  graph  at  their  proper  levels.  We  also  give 
expressions  for  Max  Me  and  Min  Me  for  any  process  graph  with  N  nodes  and  L  levels.  Using  this 
concept,  we  give  an  upper  bound  in  Section  5.4.2.2.  Section  5.4.3  then  compares  the  two  upper 
bounds  obtained  in  Section  S.4.1.1  and  5.4.2.2. 

In  Section  5.5,  we  try  to  find  the  optimal  number  of  processors  a  diamond-shaped  pro¬ 
cess  graph  in  the  case  k,  G*,  z,  P  <  oo  will  require  such  that  a  function  called  power  is  maxim¬ 
ized.  Power  is  defined  to  be  the  average  utilization  of  the  processors  divided  by  the  normalized 
average  system  time  (KLEI79].  An  expression  is  provided  for  the  optimal  number  of  processors 
per  job. 

Section  5.6  briefiy  discusses  two  loose  bounds  for  the  average  system  time  for  the  case 
k,  G*,  X,  P  <  00. 


1.3.3  Communication  Overhead 


For  the  case  k,  G,  x*,  P  oo,  we  study  the  effects  of  the  communication  overhead 
between  processors.  We  add  a  communication  task  between  any  two  neighboring  tasks  in  G 
that  do  not  reside  on  the  same  processor.  The  average  time  for  the  communication  tasks  is 
expressed  as  a  multiple  's’  of  the  average  service  time  for  a  regular  task.  Using  the  same  tech¬ 
nique  presented  in  Section  4.2,  we  find  the  resulting  average  system  time  as  a  function  of 
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In  Section  6.2,  we  limit  the  number  of  processors  to  P,  and  study  the  effects  of  the  com¬ 
munication  overhead  with  various  values  of  P  on  a  process  graph.  By  varying  the  parameter  'a', 
we  obtain  a  family  of  curves  for  the  average  system  time  versus  the  number  of  processors. 

In  Section  6.3,  we  put  a  further  condition  on  the  communication  overhead  by  allowing 
only  one  communication  bus.  Thus,  only  one  communication  task  may  be  transmitting  at  any 
particular  instant.  We  modify  the  technique  of  Section  4.2  for  the  analysis  required  in  this  sec- 


CHAPTER  2 

Background  and  Related  Work 

2.1  Brief  Hlatory 

In  the  late  50’s,  relatively  few  people  knew  how  to  work  with  computers,  and  an  entire 
computer  was  dedicated  to  one  person  at  a  time.  If  one  needed  to  use  the  computer,  he  would 
reserve  a  time  slot,  and  the  machine  during  that  time  period  would  be  dedicated  to  him.  Each 
user  waited  for  an  empty  time  slot  in  order  to  use  the  machine  (i.e.,  the  sign  up  sheet  was  the 
scheduler).  The  advantage  of  this  concept  was  that  each  user  had  the  full  processing  power  of 
the  machine  while  he  was  using  it.  The  drawback,  on  the  other  hand,  was  the  low  utilization  of 
the  processor  since  users  spent  great  deals  of  time  thinking. 

As  the  machine  became  faster  and  more  costly,  it  was  not  economically  feasible  to 
maintain  the  previous  arrangement.  To  utilize  the  machine  more  efficiently,  users  were  required 
to  punch  their  computer  jobs  on  cards  and  submit  them  to  a  computer  operator.  The  computer 
operator  would  then  schedule  jobs  by  putting  the  cards  of  different  jobs  into  the  card  reader  in 
some  predefined  order.  In  this  manner,  the  computer  was  kept  busy  most  of  the  time,  but  the 
turnaround  time  for  jobs  could  not  be  predicted  and  could  vary  from  several  hours  to  several 
days. 

The  next  step  was  a  compromise  between  the  two  extremes  of  either  low  utilization  but 
dedicated  machine  or  high  utilization  but  long  turnaround  time.  Operating  systems  were 
developed  so  several  jobs  could  share  the  processing  facilities  of  the  system  at  the  same  time. 
Priorities  were  given  to  jobs  so  that  interactive  users  who  required  small  amounts  of  processing 
time  had  the  highest  priorities  and  long  batch  jobs  had  the  lowest  priorities.  Because  the 
memory  was  also  shared  among  the  users,  paging  and  the  technique  of  virtual  memory  were 
developed.  Various  methods  of  sharing  the  processor  were  studied  and  used  to  increase  the 
throughput  and  to  lower  the  average  waiting  time  of  jobs. 

An  expensive  and  powerful  central  processing  unit  has  several  drawbacks.  The  over* 
head  of  the  operating  system  software  in  controlling  a  multiprogramming  environment  is  high. 
Jobs  are  constantly  being  swapped  in  and  out  of  the  high  speed  memory.  Such  operating  sys¬ 
tems  are  also  very  cumbersome,  as  witnessed  by  the  fact  that  there  are  still  errors  in  the 
IBM/MVS  operating  system,despite  many  releases. 


The  central  processor  also  has  the  problem  of  reliability  and  availability.  When  the 
processor  in  a  single  processor  machine  goes  down,  the  entire  machine  is  not  available  to  users. 
A  few  computer  manufacturers  have  tried  to  solve  this  problem  by  introducing  the  fault-tolerant 
machine  (e.g.,  NONSTOP  MACHINE™  by  TANDEM).  It  is  comprised  of  several  processors 
and  memories  with  the  operation  of  one  processor  backed  up  by  another  processor;  so,  whenever 
one  processor  goes  down,  its  twin  processor  will  start  up  right  away. 

This  leads  to  the  multi-processor  environment  that  we  are  addressing.  As  computer 
jobs  enter  the  system,  they  are  processed  by  one  or  more  identical  processors.  Thus,  the  system 
will  be  running  even  though  several  processors  might  not  be  available,  and  it  is  more  reliable 
since  any  processor  can  process  all  the  jobs.  As  the  price  of  the  VLSI  keeps  on  dropping,  it  will 
be  possible  to  build  a  computer  center  consisting  of  large  numbers  of  processors.  They  are  many 
difficult  issues  related  to  multi-processors  as  mentioned  in  Chapter  1,  but  we  would  like  to 
explore  the  concurrency  within  the  jobs  and  to  take  advantage  of  multi-processing  for  improve¬ 
ment  of  system  performance. 


2.2  Graph  Model  of  Behavior 

Our  representation  of  jobs,  as  described  in  Chapter  3  and  used  throughout  the  later 
chapters,  is  similar  to  the  UCLA  Graph  Model  of  Behavior  (GMB),  which  uses  a  control  graph 
and  an  associated  data  graph.  The  control  graph  is,  essentially,  a  variation  of  the  Petri  Net 
because  the  edges  represent  conditions  and  the  circles  the  transitions.  Logic  expressions  are 
assigned  to  the  sets  of  both  input  and  output  edges.  These  expressions  are  made  up  of  'and'  and 
'or'  logic.  Computation  is  simulated  by  the  movement  of  tokens  from  edges  through  nodes  to 
edges.  The  logic  expressions  determine  from  which  input  edges  the  tokens  are  removed  and  to 
which  edges  the  tokens  are  delivered.  Each  processor  is  associated  with  one  or  more  operations 
in  the  control  graph.  When  an  operation  is  initiated  by  the  control  graph,  the  processor  associ¬ 
ated  with  that  operation  executes  its  procedure.  For  each  operation,  the  data  graph  provides 
the  locations  for  both  the  input  data  and  the  stored  output  data.  After  the  control  graph  has 
been  determined,  analysis  of  the  GMB  is  carried  out  by  simulation.  This  requirement  for  simu¬ 
lation  places  a  large  overhead  on  the  analysis'  of  each  problem.  Part  of  our  work  is  based  on  a 
whole  class  of  jobs  instead  of  an  individual  job;  hence,  it  is  easier  to  obtain  system  parameters 
and  to  generalize  for  a  large  number  of  jobs. 


In  a  series  of  works  on  GMB,  a  sequence  of  researchers  (ESTR63,  MART66,  MART67a, 
MART67b,  MART67c,  MART69,  BAER68,  BOVE68,  RUSS69|  associated  various  types  of  input 
and  output  logic  with  each  node  of  a  directed  graph  model  identified  the  random  variables 
which  arise  as  a  result  of  the  application  of  programs  to  different  sets  of  input  data  and  applied 
the  model  to  evaluation  of  the  effectiveness  of  parallel  processor  systems.  In  the  course  of  these 
studies,  they  evolved  algorithms  for  the  transformation  of  cyclic  graphs  to  acyclic  graphs,  main* 
tainiog  equivalence  of  the  graphs  in  the  sense  that  mean  path  length  is  preserved  for  any 
transformed  cycle  |MART67b|.  They  developed  effective  algorithms  to  calculate  the  probability 
of  ever  reaching  a  given  node  in  the  graph  (BAER70|  and  formed  upper  and  lower  bounds  on  the 
number  of  processors  required  for  maximum  parallelism  (MART69,  BAER69].  Fernandez 
[FERN72|  transformed  the  acyclic  graphs  by  adding  precedence  relationships  in  such  a  way  that 
the  execution  time  of  the  resulting  graph  does  not  change  but  utilizes  the  minimum  number  of 
processors.  Using  GMB,  Ramamoorthy  (RAMA72|  also  scheduled  the  tasks  such  that  the  total 
execution  time  is  minimized,  and  the  minimum  number  of  processors  required  to  realize  this 
schedule  is  obtained. 


2.3  Related  Work 
2.3.1  Petri  Nets 

Petri  Nets  [PETES  l|  were  designed  to  model  systems  with  interacting  concurrent  com* 
ponents.  They  are  widely  used  in  the  area  of  software  verification.  By  modeling  a  program 
using  a  Petri  Net  and  generating  all  p<»8ible  ’markings,’  we  can  detect  the  existence  of 
deadlocks.  By  themselves,  though,  Petri  Nets  ignore  the  random  time  duration  between  the 
firing  of  two  transitions,  i.e.,  the  time  interval  between  two  markings  of  a  Petri  Net.  In 
[RAMASOj,  a  constant  time  unit  is  associated  with  each  transition.  The  performance  is  meas* 
ured  by  finding  the  minimum  cycle  time,  which  is  the  time  required  to  process  a  job.  Molloy 
[MOLLSl]  introduced  the  Stochastic  Petri  Net  (SPN),  in  which  a  random  variable  representing 
the  firing  delay  is  assigned  to  each  ’transition.’  Each  marking  in  the  Stochastic  Petri  Net,  which 
represents  a  set  of  concurrently  active  tasks,  is  associated  with  a  state  in  a  Markov  chain.  By 
solving  for  the  state  probabilities  in  this  Markov  chain,  we  can  obtain  the  density  of  ’tokens’  in 
each  place  or  each  marking.  In  Chapter  4  we  will  show  how  this  model  can  assist  us  in  solving 
for  some  system  parameters. 


A  disadvantage  of  the  SPN  is  that  each  marking  reachable  from  the  initial  marking  is  a 
state  in  the  Markov  chain.  As  the  number  of  tokens  increases,  the  number  of  states  in  the  Mar¬ 
kov  chain  grows  at  an  even  faster  rate,  making  the  SPN  analysis  very  difficult.  For  this  reason, 
the  SPN  cannot  model  a  system  with  open  arrivals,  where  the  number  of  jobs  is  undetermined 
or  the  number  of  tokens  and  states  is  unlimited.  Even  when  analysis  is  possible,  a  generaliza¬ 
tion  of  the  result  obtained  for  one  specific  job  to  other  jobs  is  not  possible. 


2.3.2  Automatic  Detection  of  Parallelism 

Parallelism  in  programs  may  be  either  explicit  or  implicit.  Explicit  parallelism  is 
specifically  indicated  by  programming  features  such  as  COBEGIN/COEND. 

Implicit  parallelism  is  the  parallelism  that  exists  in  the  algorithm  but  is  not  explicitly 
stated.  Some  common  techniques  used  by  compilers  for  detecting  implicit  parallelism  are; 

1.  Loop  Distribution 

Sometimes  the  statements  within  the  loop  may  be  executed  in  parallel.  This  idea  was 
introduced  by  Muraoka  |MURA7l]  and  later  was  implemented  by  Kuck 
[KUCK72,KUCK74|  in  their  FORTRAN  program  analyzer  to  measure  potential  paral¬ 
lelism  in  ordinary  programs. 

2.  Tree  Height  Reduction 

By  making  use  of  the  associative,  commutative  and  distributive  properties,  compilers 
may  detect  implicit  parallelism  in  algebraic  expressions  and  produce  object  code  for 
multiprocessors.  For  example, 

can  be  replaced  by 

(;»+«)+( M 

and 

**[b*e*4+e) 
can  be  replaced  by 

a*b*e*d  +  **e 

Assuming  that  only  associativity  and  commutativity  are  used  to  transform  expressions, 
Baer  and  Bovet  [BAERSSl  gave  a  comprehensive  tree-height  reduction  algorithm. 


Later,  Beatty  [BEAT72|  showed  the  optimality  of  this  method. 

In  |RAMA69],  a  survey  of  techniques  for  recognizing  parallel  processable  streams  in  FORTRAN 
programs  was  presented.  These  algorithms  are  primarily  concerned  with  detection  of  parallelism 
within  the  arithmetic  expressions.  The  problem  of  protecting  common  data  was  recognized.  If 
two  tasks  are  executed  in  parallel  and  they  both  access  the  same  data  cell,  then  different  orders 
of  execution  will  possibly  result  in  different  answers. 

Russell  |RUSS69|  developed  an  interactive  system  in  which  a  graphical  display  of  poten* 
tial  parallelism  in  Fortran  programs  together  with  detected  bottlenecks,  is  presented  for  further 
analysis  by  the  use. 

In  (KUCK77),  the  possibility  of  speed  up  in  FORTRAN  programs  is  also  studied.  Three 
levels  of  parallelism  were  discussed.  They  are: 

1.  Parallelism  within  a  line  of  code 

This  referred  to  the  reduction  of  the  tree  height  in  an  arithmetic  expression. 

2.  Parallelism  within  a  program 

Concurrent  execution  of  the  loops  in  a  program  was  explored. 

3.  Parallelism  within  the  hardware 

The  hardware  organization  of  pipeline  and  array  processors  was  discussed. 

Several  specific  FORTRAN  programs  were  analyzed  |KUCK72,KUCK74).  They  showed  the 
speedup  of  the  programs  and  the  efficiency  and  utilization  of  processors  for  each  of  the  pro¬ 
grams.  All  of  the  programs  resulted  in  some  speed  up,  most  of  them  by  a  large  amount.  The 
utilization  is,  as  expected,  quite  low.  But  most  interestingly,  they  conclude  that,  as  the  number 
of  processors  increased,  the  speed  up  is  more  than  the  logarithm  of  the  number  of  processors 
predicted  by  Amdahl  in  [AMDA67|. 


2.3.3  MuitiproceHor  Hardware  Organisation 

One  of  the  problems  in  the  design  of  a  multiprocessor  system  is  determining  the  means 
of  connecting  the  multiple  processors  and  the  I/O  processors  to  the  storage  units. 

The  four  common  multiprocessor  system  organizations  are: 

1.  Bus 

The  bus  organization  uses  a  single  communication  path  (such  as  Ethernet)  between  all 
functional  units  ~  processors,  storage  units  and  I/O  processors.  Multi*access  protocols 
are  required  to  share  this  common  transmission  medium. 

2.  Crossbar-switch 

In  this  organization,  there  is  a  separate  path  to  every  storage  unit.  The  hardware  must 
be  capable  of  resolving  conflicts  within  the  same  storage  unit. 

3.  Shuffle/Exchange  {THAN81] 

In  the  Shuffle/Exchange  network,  there  exist  logj/V  columns  of  routing  switches  connect¬ 
ing  N  processors  to  N  memory  modules.  Each  column  consists  of  N/2  two-input,  two- 
output  switches.  Figure  2  shows  an  example  of  this  organization  with  N  ^  S. 

4.  Hierarchical  (THANSl) 

A  hierarchy  is  imposed  on  the  set  of  processors  and  memory  units.  In  such  a  structure, 
each  processor  has  immediate  access  only  to  part  of  the  system  memory.  Any  reference 
to  remaining  memory  must  be  handled  by  a  higher  level  processor. 

Two  examples  of  this  organization  are  C-mmp  and  Cm*.  C-mmp 
(SIEW78a,SIEW78b,WlILF80|  is  a  16-proceasor  system  consisting  of  PDP-Il/40  mini¬ 
computers.  The  processors  share  16  storage  modules  through  a  crossbar-switch  matrix. 
Cm*  |SIEW78a,SIEW78b,HAYN82bI  consists  of  50  LSI-11  microprocessors.  It  is  con¬ 
structed  from  processor-storage  pairs  called  computer  modules.  Each  of  these  is  referred 
to  as  a  Cm.  Cm’s  are  grouped  into  clusters,  and  clusters  are  connected  by  intercluster 
busses. 


2.3.4  Theory  of  Branching  Proceasei 

The  theory  of  the  branching  process  [HARR63)  deals  with  the  problem  of  having  one 
node  initially,  with  probability,  p*.  that  there  will  be  k  desccndents  at  each  iteration,  i,  for  each 
of  the  nodes  on  the  level,  i  »  0,  1,  2,  •  •  •  .It  deals  with  the  expected  number  of  des- 

cendents  at  the  t'*  iteration  would  be,  and  what  would  be  the  probability  that,  after  i  iterations 
(for  a  large  t),  there  would  be  no  descendents  left.  The  tree  of  descendents  obtained  is  similar  to 
the  structured  process  graph  described  in  Chapter  4.  Since  the  number  of  descendents  at  each 
level  is  random,  the  resulting  tree  can  also  be  thought  of  as  a  random  process  graph  (defined  in 
Section  3.3). 

The  generating  function  of  the  number  of  descendents  at  the  i'*  iteration  was  found  to 
be  [/(z))*,  where  k  is  the  number  of  descendents  at  the  iteration, 

/(^)  = 

and  z  is  the  transformation  variable.  The  expected  value  and  the  variance  of  the  number  of 
nodes  at  the  t'*  iteration  have  been  found. 

Two  problems,  however,  prevent  this  model  from  representing  the  tasks  in  computer 
jobs.  One  is  that  there  exists  the  possibility  that  the  descendents  will  not  die  out.  This 
corresponds  to  the  fact  that  a  computer  job  will  not  terminate.  If  the  descendants  do  die  out, 
another  problem  is  how  to  merge  the  task  having  no  descendants  together.  This  corresponds  to 
the  question  of  how,  after  individual  tasks  are  completed,  the  results  are  to  be  incorporated  into 
each  other  to  form  the  final  solution. 


2.3.5  Bounds  on  the  Average  System  Time 

In  (ROBI79|,  bounds  on  the  average  system  time  of  a  trec'shaped  process  graph  were 
obtained,  using  arguments  similar  to  those  that  we  have  used  in  Section  4.3. 

Since  the  process  graph  is  in  the  form  of  a  tree,  there  exist  distinct  paths  from  each  task 
toward  the  root  of  the  tree  (the  terminating  task).  Let  Ci,  Cj,  •  •  ■  ,  C„  be  all  the  paths  from 
leaf  tasks  (i.e.,  tasks  without  any  precedence  relationships  entering  it)  to  the  terminating  task, 
and  let  H,  be  the  set  of  all  tasks  at  level  i,  for  1  <  <  <  L,  where  L  is  the  number  of  levels  in  the 
tree.  Assume  that  the  number  of  processors  is  infinite  and  that  each  task  T,  has  a  random  pro¬ 
cess  time  equals  to  1,.  Then,  the  expected  time  to  process  this  process  graph,  S,  is  bounded  by 


. ^ \ 
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In  Section  4.2.2,  we  develop  this  bounding  technique  for  general  process  graphs  (using 
the  concept  of  a  "structured  process  graph”)  instead  of  restricting  to  the  special  case  of  a  tree¬ 
shaped  process  graphs. 


2.3.ft  Task  Scheduling 

There  are  many  scheduling  algorithms  in  the  literature.  However,  the  majority  of  them 
deal  with  the  single  processor  scheduling  problem.  In  this  section  we  look  at  some  algorithms 
that  discuss  the  scheduling  problem  in  the  multiprocessor  situation. 


Price  [PRICSSj  discussed  a  shortest  path  algorithm  which  is  used  to  solve  the  scheduling 
problem  of  assigning  tasks  to  processors.  First,  a  distributed  algorithm  for  finding  the  shortest 
path  from  one  node  to  all  other  nodes  in  a  directed  acyclic  graph  (DAG),  using  as  many  proces¬ 
sors  as  needed,  is  presented.  At  the  t**  iteration  of  the  algorithm,  the  shortest  path  from  the 
root  to  node  j,  (fj'\  is  computed  as  the  minimum  of  the  distance  obtained  at  the  (»-l)'‘  iteration, 
or  the  distance  from  other  nodes  at  the  (»-l)'‘  iteration  plus  the  edge  cost  to  node  j,  ttj, 
where  1;  is  a  neighbor  of  node  y. 

=«  min  ( ) 

This  computation  can  be  performed  in  parallel  at  each  node.  For  an  N-node  DAG,  this  algo¬ 
rithm  will  find  the  shortest  path  to  all  nodes  at  the  end  of  the  iteration.  To  use  this  algo¬ 

rithm  to  solve  the  task  assignment  problem,  we  execute  the  following  changes.  Suppose  t,,  is 
the  cost  of  executing  task  i  on  processor  j,  and  c,t  is  the  cost  of  communication  incurred  if  task  t 
and  task  k  reside  on  different  processors.  The  desired  assignment  (PRIC81,  PR1C83|  is  one  which 
minimizes 

s  p  s  N  s-\  P  N 

c  “  E  E  «•;  +  X)  E  C,*  -  j]  j]  ^  e,t  z,,  Zt, 

taw+1 

P 

where  z,j  *»  1  if  task  «  is  assigned  to  processor  j,  z,,  =  0  otherwise,  2  ^  ***  *  (**‘^** 

task  i  is  assigned  to  exactly  one  processor)  and  P  is  the  number  of  processors. 


If  the  process  graph  is  tree-shaped,  then  a  DAG  is  generated  from  it  by  creating  N*P 

nodes 
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where  node  |i,j]  represents  the  assignment  of  task  i  to  processor  j.  In  addition  there  is  an  initial 
node  [0|  and  a  terminating  node  [<].  Exiges  are  generated  by  the  following  rule: 

•  edges  from  [Oj  to  [tj]  are  labeled  with  e,j  where  >  corresponds  to  the  root  task  in  the  pro¬ 

cess  graph. 

edges  from  (ijI  to  are  labeled  Ci^ 

edges  from  (i,jl  to  [r,^  are  labeled  (e,,  +c„). 

edges  from  |t,j]  to  and  from  [t,)]  to  (t,)]  do  not  exist. 

-  edges  from  [t,j]  to  [(|  are  labeled  with  the  value  0  for  every  task  t  that  is  a  terminal  task 

in  the  process  graph. 

A  specific  assignment  of  N  tasks  to  the  P  processors  consists  of  a  path  from  node  (0]  to  node  {(], 
and  the  optimal  assignment  is  the  path  that  minimizes  the  objective  function  C. 

Stone  [STON77|  uses  the  Network  Flow  Algorithm  to  solve  for  the  optimal  solution  of 
assigning  tasks  to  two  processors.  The  N  tasks  are  so  connected  that  the  edge  weights  represent 
the  cost  of  inter-task  references  (the  communication  costs)  when  the  two  tasks  are  assigned  to 
different  processors.  Next,  two  nodes,  Si  and  S2,  which  represent  processors  Pj  and  Pj,  are  con¬ 
nected  to  each  task  with  the  edge  weight  representing  the  cost  of  executing  this  task  on  that 
processor. 

Assuming  that  all  the  edge  weights  are  the  capacities  in  a  flow  network.  Stone  finds  the 
maximum  flow  from  Si  to  5^.  A  cut  set  is  found  which  divides  the  tasks  into  two  sets.  The 
tasks  in  the  same  set  as  5]  are  assigned  to  processor  P],  and  the  other  tasks  are  assigned  to  pro¬ 
cessor  P2. 

Although  this  method  does  provide  the  optimal  solution,  it  is  not  easy  to  generalize  into 
cases  with  more  than  two  processors.  In  the  case  where  P  is  very  large,  this  method  is,  indeed, 
very  difficult  to  apply. 


Van  Tilborg  (VANT8l|  discussed  the  Wave  Scheduling  technique.  The  processors  are 
organized  into  a  tree-shaped  hierarchical  structure  with  the  ’worker’  processor  at  the  leaves  and 
’manager’  processors  at  the  higher  levels  in  the  control  tree. 

Assume  that  a  job  requiring  S  processors  enters  any  processor  (either  worker  or 
manager).  If  this  is  a  worker  processor,  it  will  pass  the  request  to  its  manager.  The  manager  at 
this  level  will  try  to  assign  5  tasks  to  the  workers  under  its  control  that  are  not  busy.  If  it  can¬ 
not  schedule  a  job  of  size  5,  the  job  is  passed  up  the  control  tree  one  level  at  a  time  until  a 
manager  can  find  at  least  S  workers  under  its  control  that  are  not  busy  and  assigns  the  workers 
to  the  tasks.  Because  of  the  communication  delay,  the  manager  might  not  have  updated  infor¬ 
mation  regarding  the  busy  status  of  all  the  workers  under  his  control;  therefore,  the  manager 
will  always  try  to  assign  the  5  tasks  to  P  processors  where  P  is  slightly  larger  than  5.  The 
difficulty  is  then  to  estimate  the  value  of  P. 

Lee  [LEE77]  studied  the  problem  of  optimally  assigning  tasks  to  processors  by  minimiz¬ 
ing  a  cost  function,  which  is  the  sum  of  two  parts; 

1.  processing  cost  of  a  task  on  the  processor  assigned; 

2.  communication  cost,  which  is  the  product  of  the  volume  of  data  to  be  transferred  and 
the  distance  of  the  two  processors  measured  by  the  number  of  hops  or  the  physical  dis¬ 
tance. 

He  then  discussed  several  assignment  algorithms  that  minimize  the  above  cost  function  for  tree¬ 
shaped  process  graphs  and  more  generalized  process  graphs. 

Except  for  the  algorithms  discussed  in  (LEE77|,  none  of  this  previous  work  included  pre¬ 
cedence  relationships  among  the  set  of  tasks.  In  Chapters  4  and  5,  we  do  incorporate  pre¬ 
cedence  relationships  among  the  tasks  into  the  scheduling  algorithms. 


2.4  Discussion 

In  this  chapter,  we  introduced  some  background  information  on  why  multi-processor 
systems  have  become  more  important.  We  looked  at  some  results  obtained  from  the  analysis  of 
the  Graph  Model  of  Behavior  and  some  previous  attempts  which  take  advantage  of  parallelism 
existing  in  programs.  Some  multi-processing  hardware  organization(s)  and  the  scheduling  prob¬ 
lems  on  multi-processors  were  also  studied.  As  summerized  in  Section  1.3  we  extend  these 
results  to  our  model  of  computer  jobs  and  find  the  speed  up  achievable  in  the  multi-processor 


CHAPTER  3 
A  General  Model 


We  define  our  system  to  be  a  set  of  processors  plus  a  queue  with  unlimited  waiting 
room.  In  the  case  where  there  are  a  fixed  number  of  jobs  in  the  system,  all  jobs  are  initially 
present;  otherwise,  jobs  arrive  at  the  system  by  some  random  arrival  process.  Each  job  brings 
to  the  system  a  set  of  tasks  represented  by  a  process  graph  (described  in  section  3.2),  and  each 
task  requires  processing  by  a  resource  (described  in  section  3.1).  A  job  departs  our  ^stem  after 
its  final  task  has  been  completed.  The  system  time  of  a  job  is  defined  as  the  interval  from  the 
time  of  arrival  until  the  completion  time  of  the  last  of  its  tasks. 


3.1  Resources 

The  resources  studied  here  consist  of  a  set  of  identical  processors,  connected  via  a  local 
communication  network,  and  each  capable  of  independent  operation  on  a  single  task.  In  this 
dissertation,  we  concentrate  on  the  problem  of  task  assignment;  we  defer  questions  regarding  the 
types  of  connection  networks  most  suitable  for  parallel  processing  communications,  the  amount 
of  storage  required  for  each  processor  in  order  for  it  to  process  the  largest  task  assigned  to  it, 
the  amount  of  communication  bandwidth  necessary  to  keep  the  communication  time  small  in 
comparison  to  the  processing  time  and  the  overhead  of  this  communication  to  the  references 
|METC76,  BUX81,  KIES81,  LELA82].  Of  course,  we  recognize  that  there  will  be  higher  com¬ 
munication  delay  if  the  tasks  of  the  same  job  are  assigned  to  processors  far  apart  in  the  proces¬ 
sor  network.  In  Chapters  i  and  5,  however,  we  assume  that  this  communication  cost  is  free  (or 
as  an  approximation,  that  the  average  delay  is  incorporated  into  the  processing  requirement  of 
each  task);  in  Chapter  6  we  bring  the  communication  cost  into  the  model. 

The  processors  are  identical  in  terms  of  their  capabilities  in  processing  speed  and 
storage.  Usually,  the  total  number  of  processors  is  assumed  to  be  a  fixed  constant,  P.  In  some 
cases  where  we  have  enough  processors  to  keep  all  executable  tasks  busy,  we  assume  P  is 
infinite. 


3.2  Proeeif  Graph 


Each  computer  job  is  represented  by  a  set  of  tasks,  T,  and  a  partial  ordering  of  these 
tasks  (given  by  a  set  of  precedence  relationships).  We  represent  a  task  by  a  node  and  represent 
a  precedence  condition,  where  task  i  must  be  completed  before  task  j,  by  a  directed  edge  from  t 
to  j,  denoted  by  (t,;).  In  the  following,  all  directed  edges  point  downward  in  the  graphs;  there* 
fore,  we  will  not  put  the  arrows  on  the  edges.  In  the  following  discussion,  we  distinguish  neither 
between  nodes  and  tasks  nor  between  edges  and  precedence  conditions. 

We  use  the  directed  acyclic  graph  to  represent  the  tasks  in  a  computer  job.  Each  node 
is  a  task  that  requires  processing,  and  the  edges  (tj)  are  used  to  prevent  the  starting  of  task  j 
unless  task  t  has  been  completed.  Two  tasks  can  be  executed  in  parallel  if  and  only  if  every 
predecessor  of  one  task  does  not  include  the  other  task,  and  vice  versa.  The  precedence  rela¬ 
tionship  into  a  node  is  an  'and*  type  operator.  Suppose  {xj,Z2,  *  '  '  ,x,}  are  the  nodes  having 
edges  into  node  Then  Xq  may  start  execution  only  after  all  of  the  x„  for  tsl,2,  -  ■  -  n,  have 
completed  execution.  Without  loss  of  generality,  we  assume  there  is  only  one  starting  node  and 
one  terminating  node  for  each  job.  If  there  are  several  nodes  with  in-degrees  of  zero,  we  can 
add  a  new  node  with  in-degree  zero  and  which  points  to  each  of  these  nodes.  Similarly,  for 
several  nodes  with  out-degrees  of  zero,  we  can  add  a  new  node  with  out-degree  zero  and  then 
create  new  edges  from  all  these  nodes  to  this  new  node.  The  resulting  directed  acyclic  graph  is 
called  a  process  graph.  Figure  3.1  gives  an  example  of  a  process  graph. 

Two  parameters  characterizing  a  process  graph  are  its  length  and  width.  The  length  in 

a  process  graph,  sometimes  referred  to  as  the  total  number  of  levels,  is  the  number  of  tasks  in 

the  longest  path  between  the  starting  and  terminating  nodes.  We  place  a  node  j  at  level  t  if 

t  a  max  I  where  6,  is  the  number  of  tasks  in  path  u  from  the  initial  node  to  node  j,  and  U  is 
utu  L  J 

the  set  of  all  paths  from  the  initial  node  to  node  j.  The  width  ot  a  process  graph  is  equal  to 
max  [namher  of  laske  tn  level  t|.  In  Figure  3.1,  there  are  5  levels  and  a  width  of  3. 

I 

Process  graphs  have  a  hierarchical  structure  so  that  each  task  in  a  process  graph  could, 
by  itself,  represent  another  process  graph.  This  property  can  be  useful  in  describing  an  operat¬ 
ing  system  or  a  computer  program.  In  an  operating  system,  the  nodes  in  a  process  graph  could 
represent  the  jobs  to  be  run.  Within  each  job  (or  node),  there  is  another  structure  of  pre¬ 
cedences  which  represents  the  execution  order  of  the  tasks.  A  similar  situation  exists  in  a  pro¬ 
gram  environment.  Subroutines  can  be  represented  by  a  node  in  the  process  graph,  and  within 
each  subroutine  another  process  graph  could  exist  with  each  node  representing  an  executable 
block  of  statements.  This  hierarchical  structure  provides  a  useful  tool  in  studying  the  schedul¬ 
ing  problem  at  several  different  levels  of  complexity.  When  a  more  detailed  Khedule  is  required, 


each  node  of  the  process  graph  can  be  expanded  into  a  finer  process  graph  so  that  more  detailed 
tasks  could  be  individually  scheduled.  The  opposite  is  also  true;  we  can  schedule  the  nodes, 
each  of  which  represents  a  group  of  tasks  in  the  original  process  graph. 

Examples  of  process  graphs  representing  some  actual  jobs  include: 

1.  N  X  N  matrix  inversion 

After  the  initialization  task,  we  can  concurrently  calculate  the  determinant  and  the 
cofactors.  But  each  of  the  cofactors  (say,  i**  row  and  column)  is,  in  turn,  computed 
from  the  (N-l)x(Af-l)  submatrix  by  eliminating  the  t**  row  and  column  of  the  origi* 
nal  matrix.  Therefore,  each  task  may  expand  into  more  subtasks.  This  recursion  stops 
when  there  is  only  a  2x2  matrix  remaining  in  each  of  the  subtasks.  Figure  3.2  gives  a 
typical  process  graph  for  the  matrix  inversion  problem. 

2.  Shell  Sort 

Shell  Sort  [KNUT73|  sorts  every  A{*  number  in  order.  It  then  sorts 
Am.  Am.  '  *  *  numbers  in  order  at  each  iteration,  respectively.  The  numbers 

^t-t.  ‘  *  At  are  integers  with  A,  >  A,.|;  so,  the  number  of  parallel  sorts  depends  on 
how  many  numbers  are  to  be  sorted  and  the  values  of  A^  ^1,  2,  ...,f.  Figure  3.3  shows 
a  typical  process  graph  for  the  shell  sort. 

3.  Quicksort 

Quicksort  |KNUT73]  uses  the  first  number  (the  key  number)  in  the  list  to  divide  the  list 
into  two  parts.  The  left  lut  contains  all  the  numbers  smaller  than  the  key  number;  the 
right  list  contains  all  the  numbers  greater  than  the  key  number.  These  two  lists  can 
then  be  sorted  independently  by  repeating  the  above  procedure  (i.e.,  use  the  first 
number  of  each  list  as  the  key  number  and  sort  each  list  into  two  more  lists).  This  sub¬ 
division  continues  until  there  is  only  one  or  no  task  remaining  in  the  subdivided  list. 
Hence,  a  typical  process  graph  might  look  like  Figure  3.4.  Because  this  sorting  pro¬ 
cedure  depends  on  the  value  of  the  first  number  in  the  list,  however,  some  unusual  pro¬ 
cess  graphs  such  as  those  in  Figure  3.5  can  result. 

In  later  chapters,  we  will  consider  two  cases  of  process  graphs.  In  one  case,  the  struc¬ 
ture  of  the  process  graph  for  each  job  is  fixed  and  known  in  advance.  In  the  second  case,  each 
graph  has  a  random  structure;  so,  each  job  may  have  a  different  process  graph. 


3.3  Taxonomy 


We  have  divided  the  task  assignment  and  scheduling  problems  into  sixteen  cases.  The 
parameters  used  for  the  classifications  are: 

1.  Number  of  Jobs 

We  can  have  a  fixed  number  of  jobs,  k,  at  the  start  (time  t=0),  or  we  can  allow  jobs  to 
arrive  from  a  Poisson  source  with  an  average  arrival  rate  of  X  jobs/sec. 

2.  Types  of  Process  Graph 

Each  job  in  our  problem  can  have  an  identical  process  graph,  G,  or  each  job  can  have  a 
random  process  graph,  G'. 

3.  Processing  Requirement  of  each  Task 

Each  task  may  have  either  a  constant  or  a  random  processing  requirement.  In  the  latter 
case,  the  random  task  time  for  a  fixed  process  graph  can  be  sampled  once  at  the  begin¬ 
ning  and  used  by  ail  jobs  or  can  be  sampled  each  time  the  task  is  being  processed.  If  the 
random  sampling  is  done  only  once,  it  can  be  reduced  to  the  constant  processing 
requirement  case  by  transforming  each  task  into  a  chain  of  tasks,  the  length  of  each 
equal  to  the  service  time  requirement  and  each  task  in  the  chain  having  one  normalized 
time  unit  of  service  demand. 

4.  Number  of  Processors 

The  number  of  processors  is  given  by  P.  If  there  are  enough  processors  so  that  each  task 
can  be  processed  whenever  needed,  then  P  can  be  treated  as  infinite. 


With  the  terms  defined  above,  we  can  summarize' the  sixteen  cases  with  the  taxonomy 
tree  in  Figure  3.6. 


At  the  first  level,  we  distinguish  whether  there  is  a  fixed  number  (i)  of  jobs  at  time  t=0 
or  whether  jobs  keep  on  arriving  (at  rate  of  X)  after  time  t»0.  The  second  level  deals  with  the 
type  of  process  graph  for  each  job.  Ail  jobs  have  either  the  same  process  graph  (G)  or  random 
process  graphs  (C*).  The  next  level  separates  jobs  with  constant  task  time  (at)  from  jobs  with 
random  task  time  (2*)  service  demands.  We  also  include  a  fourth  level  for  any  situation  in 
which  the  number  of  processors,  P,  is  greater  than  the  maximum  number  of  concurrent  tasks 
that  demand  processors. 


For  all  these  cases  we  are  interested  in  a  parameter  we  call  the  concurrency  measure,  cr, 
which  is  defined  to  be  the  average  system  time  required  using  P  processors  (  S  (  P  ))  divided  by 
the  average  system  time  required  using  one  processor  (5(1)),  that  is 


a 


S(P) 

S(  I  ) 


—  measures  just  how  much  parallel  processing  is  possible  for  a  particular  job,  that  is,  —  is  the 
ff  a 

"speed  up"  factor.  Note  further  that  — ^  is  the  efficiency  of  each  processor.  In  our  notation, 

o  P 

when  P=s  00,  we  really  mean  that  we  have  a- large  number  of  processors,  say  P*,  (i.e.,  max¬ 
imum  width  of  a  process  graph  multiplied  by  the  number  of  jobs)  instead  of  an  infinite  number 

of  processors;  thus,  the  efficiency  — is  not  to  be  interpreted  as  zero  for  "P  =  00”  but  rather 

o  P 


the  efficiency  is 


<rP*' 


A  large  speedup  may  appears  as  a  good  architecture,  but  the  efficiency  of  the  processors 
is  also  important.  It  is  easy  to  get  an  efficiency  of  1,  but  this  system  will  be  very  slow.  This 
tradeoff  is  studied  when  we  discuss  the  issue  of  power  (in  Section  5.5).  Note  also  that  the  max¬ 
imum  speedup  is 


\_ 

a 


5(1  ) 
5(P) 


<  P 


Some  of  the  other  parameters  not  considered  are 
-  task  interruptibility  (preemption) 


-  homogeneous  versus  heterogeneous  processors 


Preemption  is  not  considered  since  the  communication  overfaead  required  for  each 
preemption  may  be  too  large  in  a  distributed  processing  environment.  For  simplicity,  we  have 
assumed  homogeneous  processors.  If  heterogeneous  processors  are  used,  all  assignments  must  be 
optimized  so  that  the  speed  of  the  processors  with  respect  to  each  job  must  be  considered. 

3.4  Notation 

Following  is  a  partial  list  of  notations  used  throughout  the  rest  of  this  dissertation. 
Additional  notation  will  be  introduced  as  used. 

k  number  of  jobs  present  at  the  beginning  (time  1=0) 

X  arrival  rate  of  jobs  from  a  Poisson  source 

G  fixed  process  graph 

G*  random  process  graph 

X  constant  processing  time  of  a  task 

X*  random  processing  requirement  of  a  task 

P  number  of  processors  in  the  system 

r,L  number  of  levels  in  a  process  graph 

w  width  of  a  process  graph 

N  total  number  of  tasks  in  a  process  graph 

S  average  system  time  required  to  complete  a  job 

a  concurrency  measure 


3.5  Cases  Studied 


Many  performance  objectives  are  available: 

-  minimize  the  completion  time  of  the  slowest  job; 

•  minimize  the  number  of  processors  required; 

-  minimize  the  average  system  time; 

-  minimize  the  processor  idling  time  or  maximize  the  processor  utilization. 

These  objectives  can  be  used  in  combination  or  by  themselves.  In  our  work,  we  have 
chosen  to  use  the  minimization  of  the  average  ^stem  time  (referred  to  in  some  literature  as  the 
flow  time)  as  the  performance  objective. 

Of  the  sixteen  cases  shown  in  Figure  3.6,  eight  of  them  have  a  limited  number  of  pro¬ 
cessors  (P  <  oo).  Therefore,  these  cases  require  scheduling  of  the  tasks  in  each  job.  We  defer 
most  of  the  scheduling  problem  to  the  references  cited  in  Chapter  2  and  concentrate  on  the 
cases  where  we  can  assume  that  enough  processors  exist  for  all  jobs  and  tasks  that  demand 
them. 


Two  of  the  cases,  k,  G,  x,  P  =*  oo  and  X,  (7,  a,  P  »  oo,  are  very  simple  to  analyse.  All 
the  parameters  are  deterministic;  therefore,  all  the  measures  can  be  easily  calculated.  For  both 
systems,  we  find  the  average  system  time,  5  (in  fact,  a  constant  for  all  jobs),  by  multiplying  the 

number  of  leveb,  r,  by  the  task  processing  time:  S  ^  r  x.  Hence,  <r  b  -L.  For  the  arrival  sys- 

N 

tern,  by  Little's  Result  (LITT61],  the  average  number  of  jobs  in  the  system  a  k’^X  \  r  x. 
Since  we  have  an  M/D/oo  system,  we  also  know  the  distribution  of  the  number  of  jobs  in  the 
system,  P^,  to  be 


P» 


i! 


.-X 


s 


In  all  cases  in  which  it  can  be  assumed  that  P  oo  and  z  is  constant,  the  precedence 
relationships  given  in  G  are  of  no  consequence  except  to  ascertain  the  number  of  levels.  For  the 
constant  service  time  cases,  for  every  z  units  of  service  time,  one  level  of  the  process  graph  will 
be  completed  regardless  of  whether  a  node  on  the  previous  level  has  precedence  over  this  node. 
The  assignment  problem  is  also  simplified  by  assigning  all  processors  required  to  all  tasks  on  the 
same  level  of  the  process  graph  for  z  units  of  process  time.  Because  we  have  enough  processors 
to  keep  any  number  of  jobs  active  concurrently,  the  number  of  jobs  is  irrelevant.  Any  job. 


either  k  jobs  at  time  (  ss  0  or  those  arriving  from  a  Poisson  source,  will  spent  r  z  units  of  time 
in  the  system  before  departing.  Therefore,  we  need  to  study  just  one  job  in  order  to  find  all  the 
system  parameters. 


In  Chapter  4  we  study  the  cases  i)  k,  G,  z*,  P  =*  oo,  ii)  k,  G,  x,  P  <  oo,  iii) 
k,  G,  x“,  P  <  oo  and  iv)  X,  G,  x‘,  P  =  oo.  In  Chapter  5  we  concentrate  on  random  process 
graphs  for  cases  i)  k,  G*,  z,  P=oo,  ii)  k,  G*,  z**  P=oo,  and  iii)  k,  G*,  z,  P<oo.  In  Chapter  6  we 
study  the  communication  overhead  with  the  case  k,  G,  z*,  P<oo. 


In  the  taxonomy  tree  of  Figure  3.6,  the  section  associated  with  a  particular  case  is 
shown  on  the  bottom.  The  other  cases  are  left  for  future  research. 


CHAPTER  4 
Fixed  Process  Gr&phs 


4.1  Introduction 

In  this  chapter,  we  explore  cases  where  the  process  graph  is  fixed  (i.e.,  given).  The  ser¬ 
vice  time  for  each  task  is  random  in  Section  4.2,  4.3  and  4.4  while  in  Section  4.5  we  assume  it  is 
constant.  The  number  of  processors  is  assumed  to  be  infinite;  so  the  results  obtained  are 
independent  of  the  number  of  jobs.  In  Section  4.2.1  we  first  obtain  the  average  system  time  for 
the  case  of  exponentially  distributed  service  time  for  each  task.  The  process  graph  is  first  con¬ 
verted  into  a  Markov  chain;  the  equilibrium  state  probabilities  of  each  state  in  the  chain  are 
then  obtained.  From  the  average  system  time  we  find  the  concurrency  measure  for  a  specific 
process  graph.  We  use  bounds  on  the  average  system  time  to  get  an  approximation  of  the  con¬ 
currency  measure  in  those  cases  where  the  exact  concurrency  measure  becomes  difficult.  Section 
4.2.2  describes  how  the  bounds  are  obtained.  In  Section  4.3  we  consider  the  arrivals  of  jobs  to 
the  system  instead  of  a  fixed  number  of  jobs.  In  Section  4.4  we  consider  a  finite  number  of  pro¬ 
cessors,  and  using  Stochastic  Petri  Net  theory  and  the  notion  of  power,  we  find  the  optimum 
number  of  processors  that  should  be  assigned  to  each  job.  Section  4.5  deals  with  the  assignment 
of  tasks  to  processors.  We  look  at  the  ratio  of  the  average  system  time  when  the  best  schedul¬ 
ing  algorithm  is  used  versus  that  when  the  worst  scheduling  algorithm  is  used. 

With  either  the  exact  concurrency  measure  or  bounds  on  the  concurrency  measure,  we 
have  characterized  a  process  graph.  The  average  execution  time,  the  average  width  and  the 
speed-up  that  is  possible  for  this  process  graph  can  all  be  derived  from  the  concurrency  measure. 


4.2  Fixed  Number  of  Jobe  (k,  G,  z*,  P  s  qo) 
4.2.1  The  Exact  Average  System  Time 
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lo  order  to  find  the  average  system  time  of  a  process  graph,  we  must  be  able  to  compute 
the  average  concurrency  of  the  tasks.  Towsley  [TOWS78|  introduced  a  model  of  parallel  pro¬ 
cessing  for  CPU  and  I/O  overlapping.  In  this  model,  after  a  CPU  task  terminates  (say  task 
CPUi),  it  initiates  another  CPU  task  along  with  an  I/O  task  (i.e.,  CPt/j  and  I/O)  -  see  Figure 
4.1a. 


Figure  ..4.1a  CPU  and  I/O  Overlap 


If  we  now  represent  this  system  behavior  by  a  Markov  state  transition  diagram  (Figure  4.1b),  a 
job  may  be  in  any  of  four  states: 


1)  CPU, 

2)  CPUz 

3)  I/O 


the  CPU  is  executing  task  CPI/,, 
the  CPU  is  executing  task  CPUs  alone, 
the  I/O  task  is  executing. 


4)  CPUs-I/0  the  CPU  is  executing  task  CPUs  Parallel  with  the  execution  of  the  I/O 
task. 


The  time  spent  in  each  state  is  selected  from  an  exponential  distribution  with  the  mean  service 
time  for  CPU,,  CPUs,  and  I/O  equal  to  -i-,  -i-,  and  -j-,  respectively. 
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Figure  4.1b  Markovian  State  Transition  Diagram 


In  this  section,  we  use  Towsley's  approach  in  our  concurrency  problem  by  converting 
process  graphs  into  Markovian  state  transition  diagrams.  Two  methods  of  obtaining  the  average 
system  time  are  then  discussed. 


4.2. 1.1  Converting  the  Process  Graph  Into  a  Markovian  State  Transition  Diagram 

A  Markovian  state  transition  diagram,  M(G),  is  generated  lor  process  graph  G  where 
each  state  in  the  Markov  chain  represents  a  specific  set  of  tasks  in  G  that  can  be  executed  in 
parallel.  Let  €„  represent  a  state  in  the  Markov  state  transition  diagram  where  a  is  the  set  of 
tasks  that  are  executed  concurrently.  Also,  let.|a|  represent  the  number  of  tasks  in  the  set  a. 

The  chain  starts  with  state  C/ where /is  the  initial  task  in  G.  For  each  state  C,  in  the 
chain,  it  will  go  to  |a|  other  states,  each  branch  corresponding  to  the  termination  of  one  of  the 
tasks  in  a.  The  state  C^<  at  the  end  of  one  of  these  branches  has  the  set  of  active  tasks  {a  }, 
where  o'  includes  the  tasks  in  a  minus  the  completed  task  plus  the  activation  of  several  other 
tasks  if  any,  due  to  the  termination  of  this  task.  The  exact  algorithm  is  given  in  Algorithm 


CPM  (i.e.,  Convert  Process  graph  to  Markov  chain)  in  Figure  4.2,  where  the  procedure  for 
obtaining  the  Markov  Chain  from  a  process  graph  is  described.  Figures  4.3a  and  4.3b  are  exam> 
pies  of  this  algorithm. 


ALGORITHM  CPM 


1.  For  the  initial  task  i,  we  create  an  initial  state  with  one  active  task  i. 

Mark  this  state  'unlabeled. ’ 

2.  Select  one  of  the  unlabeled  states  C„,  and  mark  it  'labeled.' 

Suppose  there  are  z  active  tasks,  ^i,  ^2,  *  *  *  ,  in  this  state  C„. 

For  each  create  a  branch  with  the  branch  label  of 

this  label  corresponds  to  the  termination  of  task 

If  we  traverse  back  from  state  to  the  initial  tasks, 

the  tasks  on  the  branches  of  this  path  form  the  set  of  completed  tasks. 

By  adding  to  this  set,  we  can  check  the  process  graph  for  new  tasks,  if  any, 
which  become  active;  call  this  set  ff,. 

The  branch  with  label  4>,  go«s  into  state  Cj  where 

w'  —  w  -  +  |^,|. 

If  does  not  exist,  we  create  this  state  and  mark  it  'unlabeled.' 

3.  If  any  state  is  not  marked  'labeled,'  go  to  step  2. 

4.  Create  a  branch  from  the  terminating  state  to  the  initial  state; 
stop. 


Figure  4.2  Algorithm  CPM 


We  find  that  there  are  as  many  levels  in  M{G)  as  there  are  tasks  in  the  process  graph  G.  This 
also  equals  the  number  of  states  visited  in  M{G)  before  a  job  cycles  back  to  the  first  state  C/in 
the  Markovian  state  transition  diagram. 


Figure  4.3a  Process  Graph 


4.2. 1.2  The  Average  Syetein  Time 


For  a  state  the  rate  of  leaving  this  state  is  where  —  is  the  mean 

of  the  exponential  service  time  of  task  ka.  Therefore,  the  mean  time  spent  in  this  state  is 
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Let  us  now  assume  that  n  for  all  tasks  i.  Therefore,  the  mean  time  a  job  stays  in  state  C. 


Due  to  the  memoryless  property  of  the  exponential  service  time  distribution,  each  task 
in  Ca  has  the  same  mean  service  time  regardless  of  whether  a  specific  task  had  been  processed  in 


another  state  .  Hence, 


i  eompUtta  Jirti  |  i  caj  » 


for  all  tea. 

Starting  from  the  initial  state,  there  are  many  paths  a  job  can  traverse  before  reaching 
the  terminating  state  in  a  Markovian  state  transition  diagram.  Since  we  know  the  probability 
of  traversing  each  branch,  the  probability  that  a  specific  path  has  been  taken  can  be  calculated. 
Suppose  the  path  taken  proceeds  through  the  following  states: 


N  j 

The  probability  of  taking  this  path  is  — r- 

Ml  I^J 


Suppose  there  are  r  different  paths  from  the  initial  state  to  the  terminating  state  in  the 
Markov  state  transition  diagram.  If  it  takes  an  average  of  T,  units  of  time  to  complete  path  t 
with  probability  p,  this  path  is  chosen,  then 

5(p)-i:r,p.  (4.1) 


The  total  number  of  paths  from  the  initial  state  to  the  terminating  state  is  enumerable. 
By  summing  the  product  of  the  total  average  time  spent  in  each  state  in  a  path  and  the  proba^ 
bility  of  taking  this  path,  we  are  then  able  to  find  the  average  system  time  of  the  process  graph 
represented  by  this  Markovian  transition  state  diagram. 


Take  the  example  shown  in  Figure  4.3b,  the  average  time  for  path 

pA,  ^BC,  ^DEC,  ^DC,  Cpr,  ('G 

including  return  to  (i.e.,  a  cycle)  is 

iLiau+i+i  J  - 

H  2  3  2  2  p  6 

»  < 

and  the  probability  of  taking  this  path  is  ~  Summing  over  the  product  of  aver* 

age  path  time  and  the  probability  of  taking  this  path  over  all  possible  paths,  we  obtain  an  aver* 
age  system  time  of  ~5.0556. 


From  a  simulation  of  this  system,  we  obtain  a  value  of  —  5.09  for  the  average  system 
time,  a  result  which  is  in  very  close  agreement  with  the  predicted  value. 


From  the  above  calculation,  we  see  that  the  average  system  time  is  greater  than  — 

II 

which  is  the  number  of  levels  in  the  process  graph  (Figure  4.3a)  multiplied  by  the  average  ser¬ 
vice  time  of  a  task.  The  reason  for  this  difference  is  that  task  G  must  wait  for  the  completion 
of  its  3  predecessor  tasks  (tasks  D,  E,  and  F)  before  it  may  start.  Thus  the  time  to  process 
nodes  D,  E,  and  F  in  parallel  (even  assuming  that  they  begin  to  be  processed  at  the  same  point 

in  time)  is  greater  than  since  we  must  wiat  for  the  slowest  of  the  three  to  complete  (and  this 

will  exceed  the  average  task  time  for  each  —  see  Equation  4.4  below).  We  are  seeing  the  cost  (in 
increased  system  time)  due  to  the  dependencies  among  the  paths  from  initial  node  to  terminat¬ 
ing  node. 


4.2.1.3  The  Concurrency  Measure 

The  concurrency  measure  can  be  calculated  from  the  average  system  time  as 


Substituting  Equation  (4.1)  into  the  concurrency  expression,  we  get, 


We  can  find  a  by  another  method.  We  have  transformed  the  process  graph  into  a  Mar¬ 
kov  state  transition  diagram.  If,  in  addition,  we  have  a  branch  going  from  the  terminating  state 
back  up  to  the  initial  state,'  «^e  then  have  a  discrete  state  continuous  time  ergodic  Markov 
chain.  The  equilibrium  state  probabilities  can  be  solved  by  the  balance  equations,  which  equate 
the  rate  into  a  state  to  the  rate  out  of  the  same  state.  In  addition,  we  need  »  1  where  a, 

I 

is  the  equilibrium  probability  at  state  i.  Even  though  there  might  be  a  large  number  of  states  in 
the  Markov  chain,  due  to  the  characteristics  of  process  graphs,  the  balance  equations  will  form  a 
lower  triangular  matrix,  and  it  is  easy  to  express  all  the  state  equilibrium  probabilities  in  terms 
of  aj.  Then,  by  using  the  ^  a,  1  equation,  we  find  aj  which,  in  turn,  gives  us  all  the  equili- 

I 

brium  state  probabilities.  Figure  4.4  gives  an  example  of  the  balance  equations  in  matrix  form 
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.  UJ.  ^,  JI,  .1.  J, 


'■B  "  ?  ■  7  ■•■■■  I  ■  v";  '  V-.' 


for  the  process  graph  given  in  Figure  4.3b,  assuming  for  ail  t. 
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Figure  4.4  Balance  Equation 


Let  us  denote  the  matrix  multiplication  in  Figure  4.4  by 

An^^BTt 

AuiK-lxK-1  squhre  matrix  (where  K  is  the  number  of  states),  11  is  a  column  vector  of 
x,'s  (t  »  1,  2,  •  •  •  ,  /C  -  1),  B  is  a  row  vector  of  the  rate  out  of  states  2,  3,  •  •  •  ,  K  -  1.  and 
if  is  a  column  vector  of  x,  's  (i  *  2,  3,  •  •  •  ,  K).  A  *it’  in  the  row  and  /**  column  in  matrix 
A  represents  an  edge  with  a  label  fi  from  state  j  into  state  (i  4-  1)  in  Figure  4.3b.  Therefore  the 
matrix  product  of  the  row  of  A  by  n  results  in  the  rate  going  into  the  state  (*  +  1).  The 
entry  in  B  represents  the  number  of  edges  leaving  state  (i  4-  1)  multiplied  by  ii.  Thus,  multiply¬ 
ing  the  entry  in  B  by  the  t**  entry  in  if  results  in  the  rate  out  of  state  (i  4-  1).  Hence, 
A  n  —  B  if  equates  the  rate  in  and  rate  out  of  state  t,  for  2  <  i  <  16.  For  example,  the  6** 
row  of  A  n  contains  n  +  ^>4  which  is  the  rate  into  state  7,  and  the  6'*  row  of  B  if  contains 
3  II  Kj  which  is  the  rate  out  of  state  7. 

Once  we  have  found  the  x„  we  can  proceed  as  follows  to  obtain  the  concurrency  meas¬ 
ure. 


4S 


where  is  the  number  of  active  tasks  in  state  k  and  is  the  average  number  of  tasks 

k 

being  executed  over  the  entire  execution  period.  Hence, 


<T  a* 


1 


(4.2) 


This  is  the  main  result  of  this  section. 


From  renewal  theory,  we  also  know  that  5  (P )  is  the  mean  recurrence  time.  Thus, 
5  (P)  as  — L.  where  iTj  is  the  equilibrium  probability  of  either  the  initial  or  the  terminating 

itj/i 

state.  Hence, 


1 

*1  N 


A  simple  example  is  given  next.  Figure  4.5  is  a  process  graph  with  its  Markov  chain 
shown  in  Figure  4.6.  The  set  of  balance  equations  are 

—  IlKi 
fix,  —  2^X2 

Hitt  -s  tint  as  /IX4 


The  solution  is 


Figure  4.6  Markov  Chain 
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The  small  speedup  ( —  a  •=■)  is  due  to  the  serial  nature  of  the  process  graph  in  this 
c  7 


example.  Only  two  of  the  four  tasks  may  be  processed  in  parallel  and  only  for  the  duration  of 
t  as  min  I  <j  ,<2  I  where  t„  i  ss  1,  2,  is  the  random  processing  time  of  one  of  the  tasks  that  can  be 
executed  in  parallel. 


4.2.2  Bounds  on  the  Average  System  Time  {k,  G,  x\  P  =  oo  ) 


In  Section  4.2.1,  we  found  the  exact  concurrency  measure  (Equation  4.2)  for  a  fixed  pro¬ 
cess  graph  with  random  task  times.  Although  the  algorithm  used  in  obtaining  it  is  not  difiTicult 
to  carry  out,  it  is  cumbersome  to  either  calculate  5  (P)  by  going  through  all  the  paths  in  the 
Markovian  state  transition  diagram  or  to  solve  for  the  equilibrium  state  probabilities  from  a  set 
of  balance  equations  derived  from  the  Markov  chain.  If  the  exact  concurrency  measure  is  not 
required,  we  may  use  upper  and  lower  bounds  as  substitute  measurements  for  the  concurrency; 
they  are  usually  much  more  easily  obtained  than  the  exact  solution. 

In  order  to  find  an  upper  bound,  we  "synchronize”  the  execution  at  each  level  by  forcing 
all  the  tasks  in  the  next  level  to  wait  for  the  slowest  task  in  the  current  level  to  complete  before 
they  all  start  executing.  We  call  the  time  between  the  synchronization  of  two  neighboring  levels 
the  "forced  ^nchronization  time”  (FST).  If  we  sum  up  the  FST  at  each  level  of  a  process 
graph,  an  upper  bound  for  5  (P)  is  obtained.  For  a  lower  bound,  we  just  find  the  average  time 
required  to  execute  the  tasks  in  the  longest  path  from  the  initial  node  to  the  terminating  node. 
Since  this  is  the  minimum  time  required  for  any  job  with  this  process  graph,  we  have  a  lower 
bound.  In  the  following  sections,  we  describe  the  exact  procedures  for  finding  these  bounds. 


4.2.2. 1  Blocking  Time  of  Predecessor  Tasks 

In  this  section,  we  find  the  average  time  contributed  by  the  "blocking  nodes.”  Blocking 
nodes  of  a  specific  node  i  in  the  process  graph  are  the  nodes  that  have  precedence  relationships 
into  node  t.  Since  a  task  cannot  start  execution  until  all  of  its  predecessors  have  been  com¬ 
pleted,  we  would  like  to  find  the  average  time  required  for  the  completion  of  all  its  predecessors 
in  the  previous  level  (assuming  they  all  started  at  the  same  time). 

Each  node,  i,  in  G  has  several  precedences  entering  it  and  several  precedences  exiting  it. 
During  the  processing  of  a  task,  out-degrees  do  not  influence  the  completion  time  of  this  task, 
but  in-degrees  do.  Suppose  there  are  n  precedences  entering  this  node;  the  task  can't  start  exe¬ 
cution  until  all  n  tasks  are  done.  In  other  words,  assuming  that  all  these  n  tasks  are  begun  at 
the  same  time,  the  effective  execution  time  of  these  n  tasks  with  respect  to  node  i  is  equivalent 
to  the  max((|,  <2,  ...  ,  f*),  where  tj,  y~l,2,  *  *  *  n  is  the  random  service  time  of  task  j.  We  find 
the  probability  distribution  function  of  this  max  as  follows: 

»*  Proi  Icomplefion  time  of  n  tatko  <  f] 
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=«  Prob  [<1  <  <1  tProb  (<2  <  (]••••  *Prob  j/,  <  t| 


=  1^01" 


where  f(()  is  the  probability  distribution  function  for  the  service  time  of  one  task.  The  equality 
is  due  to  the  independence  of  service  times  of  the  tasks.  From  probability  theory,  we  derive  the 
expected  time  for  finishing  n  tasks  from 

E  [  completion  time  of  n  taste  ] 


=  jf  ■« 

-  /.“  [  '-ft')"]  it  (4.3) 


Since  F(ty  <  F{t),  E  (completion  time  of  n  tasks]  >  E  [completion  time  of  one  task].  If  the  ser¬ 
vice  time  is  exponentially  distributed  with  an  average  service  time  of  l/fi,  then  Equation  (4.3) 
becomes 

completion  time  of  n  tasks  toilk  1 
^  exponentially  distributed  service  <tme«| 


(4.4) 


Note  that  for  n  »  1,  Equation  (4.4)  is  approximately  -■** (where  4  is  Euler's  constant 
»  .57721...  )  Equation  (4.4)  can  also  be  obtained  from  the  following  equivalent  queueing  system: 

A  service  center  with  n  servers,  each  server  having  an  exponential  service  time  distribu¬ 
tion  with  mean  of  l/fi; 


No  waiting  room  allowed  in  the  system  and  servers  starting  execution  when  there  are  n 


customers  in  the  system; 


Once  the  servers  start  execution,  no  new  arrivals  allowed  to  replace  the  departed  custo¬ 
mers. 


The  time,  S,  to  complete  all  n  customers  measured  from  the  start  of  execution  is 


I 


The  equivalence  can  be  shown  as  follows: 

The  n  customers  in  the  queueing  system  are  equivalent  to  the  n  blocking  nodes; 

One  server  becomes  inactive  after  each  customer  departs  from  the  queueing  system  and 
this  is  the  same  as  the  completion  of  one  blocking  task; 

The  service  time  of  each  customer  is  exponentially  distributed  with  a  mean  service  time 
of  and  the  time  required  to  complete  a  blocking  task  is  also  exponentially  distri¬ 
buted  with  the  same  mean  service  time  of 


Hence,  the  average  time  required  to  empty  this  n-server  queueing  system  is  the  same  as  the 
average  time  needed  to  complete  the  n  blocking  tasks  if  all  of  them  start  execution  at  the  same 
time. 


From  equation  (4.4),  we  note  that  without  the  blocking  effect,  each  task  of  a  process 


1 


graph  requires  an  average  service  time  of  However,  with  the  blocking  effect,  the  average 

time  to  complete  a  task  becomes  approximately  where  n  is  the  number  of  tasks  blocking 

^  —  Siving  a  speedup  of  7-2— ;  this  falls  short  of  the  maximum 

n  n  In  n 


this  task.  Thus  o 


possible  speedup  which  is  equal  to  n.  The  reason  for  this  poor  speedup  is  clearly  the  blocking 
effect. 


Since  we  know  the  in-degrees  of  each  node  in  the  process  graph,  the  average  time 
required  to  wait  for  the  completion  of  all  the  precedence  tasks  for  a  specific  task,  i,  can  be  cal¬ 
culated  from  Equation  (4.4). 


4.2.2.2  Bounds  for  Structured  Process  Graphs 


We  will  first  study  "structured  process  graphs,”  which  are  defined  as  having  the  follow, 
ing  properties: 

1.  All  sons  of  a  node,  i,  must  merge  back  into  one  single  node,  j,  before  reaching  the  ter. 
minating  node,  and 

2.  only  node  t  can  have  a  direct  precedence  relationship  into  each  son  of  node  t;  no  other 
nodes  may  have  direct  precedence  relationships  into  sons  of  node  t. 


With  the  above  properties,  each  son  may  be  replaced  with  a  set  of  tasks  with  the  same  proper, 
ties.  A  structured  program  is  a  good  analogy  of  thu  process  graph.  In  a  program,  which  must 
be  entered  at  one  specific  point  and  exited  at  another,  several  parallel  blocks  of  code  may  be 
executed  simultaneously,  but  each  block  must  be  entered  and  exited  from  the  specific  points. 
Property  2  above  states  that  no  ’GOTO’  statements  may  direct  the  execution  out  of  or  into  a 
block  of  code.  Figure  4.7a)  shows  a  structured  process  graph,  while  the  edge  e  in  Figure  4.7b) 
violates  the  property  of  a  structured  process  graph. 

Within  each  structured  process  graph,  we  can  divide  the  tasks  (other  than  the  starting 
and  terminating  nodes)  into  several  mutually  exclusive  sets,  where  each  set  of  nodes,  together 
with  the  starting  and  terminating  nodes,  forms  a  structured  sub.process  graph.  Property  2 
prevents  the  precedences  from  a  node  in  one  set  of  tasks  leading  into  a  node  in  another  set.  We 
call  each  set  of  the  tasks  a  ’group.path,'  and  we  let  m  denote  the  number  of  group.paths  in  a 
structured  process  graph.  Algorithm  GP  below  describes  a  method  for  finding  all  group-paths 
for  a  given  structured  process  graph. 

This  algorithm  begins  at  the  starting  node  of  G  and,  by  keeping  track  of  the  nodes 
diverging  out  of  each  task,  looks  for  nodes  that  are  in  a  single  grou|>>path.  If  tasks  eventually 
merge  back  after  diverging  out  of  a  single  node,  they  are  considered  as  one  group-path.  If  tasks 
do  not  merge  back  before  the  terminating  node,  they  will  be  considered  as  separate  group-paths. 

By  substituting  other  tasks  for  the  starting  and  terminating  nodes  in  Algorithm  GP,  we 
can  find  the  sub-group-paths  within  the  process  graph. 


below: 


Nodes  within  the  same  group-path  have  the  same  PATHNUMBER  in  the  algorithm 


Alforithm  GP: 


■- .  ■  .->■ 


..  • . 


Figure  4.7  a)  A  structured  process  graph  b)  A  non>structured  process  graph 


STEPO 


Mark  all  nodes  ’NEW.’ 
PATHNUMBER  «-  1 
PATH(t/)  *-  0  for  all  nodes  v 
STACK  ’empty’ 


STORAGE  -  ’empty’ 


V  «-  the  starting  node  of  G. 


Mark  v  ’OLD.’ 

Store  the  sons  of  v  on  the  STACK. 

Select  a  task  from  the  top  of  STACK. 

Call  this  task  v. 

If  STACK  is  empty,  STOP. 

If  u  is  ’NEW. 

Put  V  in  STORAGE. 

Mark  it  ’OLD.’ 

Put  all  sons  of  v  on  STACK. 

GO  TO  STEPS. 

If  wis  ’OLD.’ 

PATH(ii;)  •H-  PATH(v)  for  all  nodes  w  in  STORAGE 
STORAGE  ^  ’empty’ 

GO  TO  STEP2. 

If  V  is  the  terminating  node, 

mark  it  ’NEW’  again  (since  the  terminating  node 
is  not  considered  to  be  in  any  one 
particular  group-p^h) 

PATH(u;)  PATHNUMBER  for  all  nodes  w  in  STORAGE 
STORAGE  ♦“  ’empty’ 

PATHNUMBER  «-  PATHNUMBER  +  1. 


GO  TO  STEP2. 


For  mol,  we  know  that  there  does  not  exist  any  individual  path  that  is  not  coupled 
with  some  other  part  of  the  G.  Instead,  many  precedence  relationships  exist  between  two  neigh¬ 
boring  levels.  Because  of  this,  the  line  of  active  tasks  in  G  is  roughly  the  same  as  the  levels  of 
G  since  blocking  prevents  any  tasks  from  becoming  active  if  they  are  several  levels  ahead  of  the 
active  tasks. 

From  this  analysis,  we  can  immediately  find  an  upper  bound  to  the  average  system  time 
of  G.  If  we  force  the  execution  to  complete  one  level  of  G  at  a  time,  no  tasks  in  the  following 
level  are  allowed  to  start,  even  if  there  are  no  tasks  on  the  current  level  to  block  this  task.  The 
average  service  time  for  each  level  is  the  average  time  required  to  process  the  slowest  task,  i.e., 
the  task  with  the  largest  number  of  blocking  nodes  from  the  previous  level.  We  introduce  a  new 
parameter  dmau  which  gives  the  largest  in-degree  per  task  for  all  tasks  on  level  i  of  G.  From 
our  definition  of  the  process  graph,  we  have  dt^.^0  and  for  all  G. 


Theorem  d.l 

Given  a  process  graph,  G,  which  has  one  group  path  r  levels  and 

{damx  I  ‘  r},  5(/b.  aq  upper  bound  for  the  average  system  time  of  G,  is  equal  to  the 

sum  of  the  average  times  required  to  process  the  node  with  the  largest  in-degree  at  each  level. 

»  • 
t  average  time  required  to  proeen 

^  tf  node  with  in-degree  of 


-vs  S'! 

M  |_1  J 

If  doQu  »  0,  the  average  processing  time  is  defined  to  be  the  average  service  time  of  one  task. 
(The  sum  on  j  can  be  ^proximated  by  In  d,,^  when  d,^  »  1  ). 

Proof: 

d _ is  the  largest  in-degree  for  level  t  of  G.  If  we  sum  the  average  times  required  to  process 

tasks  t,  for  *=1,2,  •  •  •  ,r  and  the  in-degree  of  t,  is  d,^^  the  resulting  average  time  equals  that 
of  a  process  graph  with  a  path,  p,  from  the  starting  to  the  terminating  nodes,  where  each  node 
on  this  path  at  t'*  level  has  an  in-degree  equal  to  d^^,.  If  any  one  of  the  (d— ..  }  is  not  on  the 
same  path,  we  must  show  that  the  resulting  average  system  time  is  no  greater  than  Si/g 


Suppose  the  maximum  in-degree  node  on  level  y  of  G  is  on  path  p'  rather  than  path  p 
(see  Figure  4.8).  Let  i|  denote  the  node  in  the  level  for  path  p,  ^  the  node  in  the  level 
for  path  p  and  %  the  node  in  the  level  for  path  p' .  Since  the  number  of  blocking  nodes  for 
node  tj  (that  is  d^),  is  less  than  or  equal  to  the  number  of  blocking  nodes  for  node  4 
the  average  time  required  to  complete  node  t|  is  smaller  than  or  equal  to  the  average  time 


PATH  P 


r  LEVEL  (  i, 


(j+1)*‘ LEVElT  ii 


Figure  4.8  Maximum  In-degree  Nodes 


required  to  complete  node 


I  x-l  1  ^  1  T-«  1 


-  E  ^  -  E  for  d,  <  rf.  «  d,, 

.-I  •  ‘  *1  -  " 


Therefore,  the  average  time  to  complete  path  p  with  node  t|  in  the  level  instead  of  node  4  is 
smaller  than  or  equal  to  5(/0.  Similarly,  if  the  nodes  *  ,  and  on  levels 

ih  3i>  ‘  •  %nd  jo  of  path  p  are  not  the  maximum  in-degree  nodes,  then 

*'  *  — 

I  «  I  •  I 

-E  E-^~E  E- 

^  fMl  *■  /_|  tm,l  ^ 

because  d,^  <  for  ( ~  1,  2,  •  •  •  ,  a. 

Hence,  5(/b  is  an  upper  bound  for  the  average  system  time  of  G. 


A  simple  lower  bound  on  the  average  system  time  is  just  the  average  time  required  by  a 
task  multiplied  by  the  number  of  levels. 

Theorem  4.t 

A  lower  bound  of  the  average  system  time  of  a  process  graph  G,  given  r  levels,  is  the  average 
service  time  of  one  task  multiplied  by  r. 


Since  each  level  is  defined  as  having  at  least  one  task  in  it,  the  minimum  time  required  to  pro¬ 
cess  one  level  is  the  service  time  of  one  task.  For  r  levels,  the  minimum  average  system  time 
cannot  be  lower  than  the  average  service  time  of  one  task  multiplied  by  r. 

Ill 

The  upper  and  lower  bounds  obtained  in  the  above  two  theorems  are  for  different  pro¬ 
cess  graphs  with  specific  sets  of  however,  each  set  of  { d. _ |Vt}  can  represent  a 

number  of  different  process  graphs.  Indeed,  if  we  do  not  limit  the  number  of  tasks  per  process 

graph,  there  could  be  an  infinite  number  of  process  graphs  generated  from  each  {rf; _ |V/i}.  In 

the  proofs  of  both  theorems,  the  number  of  tasks  and  the  structure  of  the  task  graphs  were  not 
used.  In  other  words,  provided  the  same  sets  of  f  in  both  theorems,  we  have  a  class  of 

process  graphs  with  the  same  upper  and  lower  bounds. 

For  a  structured  process  graph  with  m»l  group-path,  the  actual  progress  of  active 
tasks  sometimes  closely  follows  the  physical  levels  of  the  process  graph.  This  is  caused  by  the 
precedence  relationships  between  adjacent  levels  (which  forced  the  synchronization  at  each 
level).  Hence,  the  average  system  time  is  often  close  to  the  upper  bound.  This  fact  has  also 
been  verified  by  simulation.  For  m>l  group-paths,  since  there  are  no  precedence  relationships 
between  (the  group-paths,  the  line  of  active  tasks  progresses  at  a  different  rate  for  each  group- 
path,  depending  on  the  random  service  time  requirement  of  each  task. 

If  a  process  graph  has  more  than  one  group-path,  the  method  described  above  for 
obtaining  bounds  for  must  be  improved  to  show  the  influence  of  the  number  of  group- 
paths.  To  classify  a  general  structured  process  graph,  we  require  the  maximum  in-degrees  for 
each  level  of  all  group-paths;  {  . }.  The  order  of  maximum  in¬ 

degrees  per  level  in  {dmaxIVt}  for  tmvi  does  not  influence  the  bounds  since  any  permutation  will 
produce  a  process  graph  with  a  similar  bound.  By  the  same  argument,  the  order  of  the  sets  of 
the  maximum  in-degrees  for  each  group-path  does  not  influence  the  bounds. 

As  has  been  discussed  in  the  case  of  m  b  i,  the  average  system  time  is  usually  close  to 
the  upper  bound;  therefore,  we  will  use  the  forced  synchronization  per  level  to  approximate  the 
average  process  time  required  for  each  group^patk.  For  a  group-path  j  and  level  t  with  max¬ 
imum  in-degree  node  having  a  value  of  the  probability  distribution  function  for  the  service 
time  of  this  level  is  and  the  probability  density  function  is 


(4.6) 


m 


it 


To  obtain  the  approximate  average  system  time  for  the  multi-path  process  graph,  we  can  reduce 
each  group-path  into  one  'super-node.'  The  probability  density  and  probability  distribution 
functions  of  the  service  time  for  each  of  the  super-nodes  j  are 

/,(0*/i,(<)®4(0®  •••  ®W0  (4-6) 

where  ®  represents  the  convolution  operator,  and 

=  (4-7) 


respectively,  where  each  is  a  probability  density  function  represented  by  Equation  { t.S). 


Looking  at  the  terminating  node,  it  has  m  'super-nodes'  "blocking”  it.  This  is  the  exact 
analogy  of  Equation  (4.3)  with  /^f)  replaced  by  the  Fj^t)  of  Equation  (4.7): 

the  average  time  required  to  preeeot  a  I 
^  node  with  in-degreet  of  m  group-pathn 


So 


1  -  n^;(0 


dt 


(4.8) 


The  convolution  in  Equation  (4.4)  is  a  tedious  task.  However,  according  to  the  central 
limit  theorem  {PAP065|,  as  we  add  up  a  large  number  of  independent  random  variables,  the 
probability  density  function  of  the  resulting  sum  is  close  to  a  normal  density  function,  with  the 
average  of  the  sum  equal  to  the  sum  of  each  level's  mean  process  time  and  the  variance  of  the 
sum  equal  to  the  sum  of  the  variances  of  each  level.  We  assume  that  the  service  time  of  the 
tasks  between  the  levels  are  independent  of  each  other.  For  each  group  path,  i,  we  approximate 
the  probability  density  function  of  its  processing  time  by  the  normal  density  function  with  mean 
a,  and  variance  i„  where  a,  is  the  sum  of  the  average  times  needed  if  the  forced  synchronization 
at  each  level  in  the  group  path  is  used  and  h,  is  the  sum  of  the  corresponding  variances  of  the 
average  processing  times  at  each  level. 


Using  this  approximation,  Equations  (4.6)  and  (4.7)  become 


/.(0« 


1 

\/2jr6, 


(4.9) 


and 


(4.10) 


respectively.  Substituting  /',(<)  into  Equation  (4.8),  we  may  then  calculate  the  upper  bound, 
S^,  of  the  average  system  time  (Equation  (4.8))  required  to  complete  a  process  graph  with  m 
group>paths  as 
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IBBl 
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/2 


dx 


4.2.2^  Bounds  for  Non*struetured  Process  Graphs 

For  a  non-structured  process  graph,  it  is  harder  to  obtain  an  improved  expression  for 
the  average  system  time.  Although  the  lower  bound  expression  on  the  average  system  time  is 
still  the  same  as  that  of  the  structured  process  graph,  we  have  been  unable  to  find  an  upper 
bound;  this  is  due  to  the  complicated  coupling  of  the  tasks,  which  makes  it  almost  impossible  to 
find  group  paths. 


4.2.2.4  Tightness  of  the  Bounds  (m-  1) 

The  upper  bound  is  obtained  by  summing  the  FST  at  each  level  in  the  process  graph. 
For  a  given  number  of  tasks,  N,  and  number  of  levels,  r,  we  know  the  lower  bound  to  be  —  r. 

By  obtaining  the  upper  bound  on  the  worst  arrangement  with  N  and  r,  we  know  approximately 
how  tight  the  bounds  are.  Since  we  distribute  one  node  each  for  the  initial  and  terminating 
tasks  and  S  ~2  nodes  among  the  remaining  r  -  2  levels  and  assume  that  precedence  relation¬ 
ships  exist  between  all  nodes  of  the  adjacent  levels  (this  being  a  process  graph  with  any  two 
adjacent  levels  forming  a  complete  bipartite  graph),  we  are  looking  for: 
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subject  to  2  "j  “  ^~2. 


integer,  then 


N-2 

[  f-2 
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for  j  «  2,  3,  •  •  •  ,2  where  x  =  remainder  of  — —  and 
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N-2 

This  maximum  occurs  when  n,  =  — —  for  all  j  «  2,  3,  4, 

^  r-2 

N-2 


for  j  B  ^1,  x+2,  '  •  •  ,  r-2.  The  ratio  of  the  maximum  upper  bound  Suub  to 
the  lower  bound  gives  us  a  measure  of  the  tightness  of  the  bounds. 


For  example,  for  N  *  6,  r  *  4, 


Suva  “  *r  2  +  (^-2)2  Y 
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5ta--l4 


^SiUB 

Thus,  — - —  a  1.25,  which  indicates  that  the  upper  and  lower  bounds  are  very  close  to  each 
ots 

other,  but  for  a  larger  process  graph  such  as  iV  102,  r  «  22, 


Suva 

2  +  202  ^ 

•■d  * 

^la 

22-L 

the  bounds  are  further  apart. 

2.17  , 


Of  course,  we  are  comparing  the  maximum  upper  bound  possible,  but  as  N  becomes 
large,  the  ratio  of  the  upper  to  lower  bounds  also  gets  larger.  The  exact  upper  bound  depends 
on  the  number  of  levels  and  the  number  of  precedence  relationships  in  a  given  process  graph. 
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Where,  betweeo  the  two  bounds,  does  the  exact  average  system  time  lie?  This  depends 
on  how  tightly  the  tasks  are  coupled  to  each  other  in  the  process  graph.  The  more  tightly  they 
are  coupled,  the  closer  the  average  system  time  is  to  the  upper  bound,  and  vice  verssL  Figure 
4.9  shows  a  process  graph  with  average  system  time  close  to  the  upper  bound. 


4.3  Arrivala  of  Jobi  (X,  G,  x*,  P  »  oo  ) 

In  the  previous  section  (Sections  4.2),  we  obtained  the  average  system  time  and  bounds 
on  the  average  system  time  of  a  job.  Since  we  have  assumed  that  there  is  an  infinite  number  of 
processors,  all  jobs  start  execution  at  time  zero.  In  this  section,  we  assume  that  the  jobs  arrive 
from  a  Poisson  source.  As  soon  as  a  job  arrives  at  the  system,  it  starts  execution  (again  this  is 
due  to  P  =  oo).  Thus,  the  results  obtained  in  Section  4.2  can  also  be  applied  to  this  case.  In 

addition,  from  Little’s  result,  we  have,  on  the  average,  k  «  X  5  (P)  jobs  in  the  system,  where 
5(P)  is  the  average  system  time  obtained  in  Section  4.2.1.2,  and  we  have  the  bounds  on  the 
average  system  time  as 

kt®  “  ^  ^UB 
and 

where  SvB  and  Sig  are  the  bounds  obtained  in  Section  4.2.2. 


4.4  Stoehaatle  Petri  Netn  (k,  G,  z*,  P  <  oo) 

In  this  section,  we  limit  the  number  of  processors  to  P  <  oo,  and  the  Stochastic  Petri 
Net  (SPN)  [MOLL81]  model  is  used  to  find  the  average  utilization  of  these  P  processors  given  k 
jobs,  a  fixed  process  graph  and  task  service  times  which  are  exponentially  distributed.  A  process 
graph  can  easily  be  transformed  into  a  Petri  Net,  as  was  shown  in  Section  2.3.1.  To  the  result¬ 
ing  Petri  Net  we  add  a  "place”  called  "Processor  Available,”  with  P  tokens  in  it,  and  another 
"place”  called  "Unexecuted  Jobs,”  with  k  tokens  in  it.  Initially,  all  other  places  have  no  tokens 
in  them.  V/e  add  an  edge  from  the  "Processor  Available"  place  to  each  transition  requiring  a 
processor  and  another  edge  from  each  transition  finished  using  the  processor  to  the  Processor 
Available’  piMe.  Figure  4,10  gives  an  example  of  how  we  transform  a  process  graph  into  such  a 
Petri  Net. 
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Figure  10b  PETRI  NET 


When  all  tokens  in  the  "Unexecuted  Jobs”  place  have  been  used  up,  and  no  other 
tokens  remain  in  any  place  except  the  'Processor  Available’  place,  the  Petri  Net  said  to  have 
reached  the  recurrent  state.  From  the  analysis  provided  by  the  Stochastic  Petri  Net,  we  can 
find  the  average  number  of  tokens,  7,  in  the  "Processor  Available”  place,  which  also  indicates 
the  average  number  of  idling  processors.  Hence,  the  average  utilization  of  the  P  processors  is 


Since  P  is  limited,  we  do  have  a  scheduling  problem;  however,  a  SPN  does  not  allow  the 
assignment  of  a  specific  task  to  a  processor.  The  assignment  depends  on  which  transition  requir¬ 
ing  a  processor  fires  next.  Thus,  the  performance  obtained  with  a  SPN  analysis  lies  between  the 
best  and  worst  assignment  results. 

If,  instead  the  value  of  P  is  the  design  parameter,  we  may  use  the  definition  of  power 
[KLEI79|  to  find  the  optimal  number  of  processors  for  a  specific  process  graph  and  number  of 
jobs.  Power  is  defined  to  be  the  utilization  of  the  processors  divided  by  the  normalized  average 
system  time.  A  SPN  provides  the  values  of  both  of  these  variables  for  a  specific  value  of  P.  We 
can  therefore  plot  power  versus  the  number  of  processors  to  find  that  number  P  at  which  the 
power  will  be  maximized. 


4.5  Task  Assignment  (k  G,  z,  P<oo) 

In  this  section,  we  find  bounds  on  the  average  system  time  by  developing  algorithms 
that  will  give  the  best  and  worst  scheduling  in  terms  of  the  average  system  time. 

If  the  ratio  of  these  two  bounds  is  not  large,  perhaps  random  scheduling  of  the  tasks  to 
the  processors  could  then  be  allowed.  Random  assignment  has  the  advantage  of  no  overhead 
being  needed  to  schedule  tasks.  Whenever  a  processor  is  available,  it  will  just  grab  any  task 
that  is  ready  to  be  executed.  We  know  the  performance  of  the  system  must  fall  between  the 
two  bounds. 


First,  we  assume  the  shape  of  the  process  graph  to  be  bounded  by  a  diamond  as  in  Fig¬ 
ure  4.11. 


Figure  4.11  Diamond-shaped  Process  Graph 


This  type  of  process  graph  can  be  characterized  by  two  parameters:  L  and  m,  where  L  is  the 
number  of  levels  in  the  process  graph  and  m  is  the  slope  of  the  diamond  enveloping  the  boun¬ 
dary  tasks.  We  assume  a  continuum  of  tasks  within  the  diamond. 

Since  the  service  time  of  tasks  are  constant,  we  normalize  the  service  time  of  each  task 
to  one  unit  of  time. 

From  (COFF76],  we  know  that  the  assignment  which  minimizes  the  average  system 
time  is  the  shortest  expected  remaining  processing  time  Srst  assignment.  This  is  the  Depth-flrst 
Assignment  Algorithm,  where  all  available  processors  will  be  assigned  to  the  tasks  in  a  job  that 
is  closest  to  being  completed.  In  other  words,  we  are  trying  to  complete  jobs  as  fast  as  possible. 
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On  the  other  hand,  if  we  want  to  maximite  the  utilization  of  the  processors,  then  the 
longest  expected  remaining  time  first  assignment  is  used.  This  is  the  Breadth-first  Assignment 
Algorithm,  where  all  available  processors  are  assigned  to  the  jobs  that  have  the  least  amount  of 
processing  to  be  done.  In  this  assignment,  we  are  trying  to  process  ail  jobs  at  the  same  time,  so 
that  ail  the  jobs  complete  at  times  very  close  to  each  other.  We  are  interested  in  finding  the 
ratio  of  the  average  system  time  obtained  from  these  two  assignments, 


S4 


where  5^  is  the  average  system  time  using  Breadth-first  Assignment  and  5^  is  the  average  system 
time  using  Depth-first  Assignment. 


If  we  assume  all  jobs  depart  at  the  same  instant  as  the  last  job  when  calculating  5), 


then 


L  L  '■iS 


5»  —  2fi  + 


Pm  L 

where  ri  and  P<—k.  Simplifying  the  above  expression,  we  get 

2ir  fit 


«  ^  Pm  ^  kL* 
‘  2k  2Pm 


If  we  provide  a  maximum  number  of  processors,  P  ^  k,  then 


that  is,  it  takes  L  units  of  service  time  to  complete  all  k  jobs.  This  is  what  we  expect,  since 
each  job  has  —  processors,  which  is  equivalent  to  00.  Thus,  each  job  takes  L  units  of 

fVI 

time  to  complete  and  all  k  jobs  run  in  parallel. 


If  we  let  P  -»  1,  then 


■^  +  kJL 

2*  ^  •  2  m 


Since  there  are  ■—  tasks  in  each  job,  it  takes  k  ( —  -  l| 
^  I  2  m  J 

job.  Thus,  the  average  system  time  for  these  k  jobs  is 


+  i  units  of  time  to  complete  the  P 


k{k+l) 

2 


The  difference  between  5  and  5^  is  due  to  the  assumption  in  calculating  that  all  jobs  depart 
at  the  same  instant  as  the  last  job;  this  assumption  is  pessimistic,  as  we  see,  and  so  it  may  be 
made  in  obtaining  our  bound.  Thus,  for  *  >  2,  5*  >  5. 


For  example,  if  we  let  P  1,  m  »  1,  k  »  10,  and  L 


S, 


m  kL* 
2k  2m 


125  +  ';^ 

20 


5,  then 


and 
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As  predicted,  5  <  5*.  By  changing  the  value  of  P  to  50  while  keeping  all  other  parameters  in 
the  above  example  the  same,  we  get 


5. 


50 


(2)(10) 


(10  )  (  25  ) 
(2)(50) 


5 


which  is  exactly  the  average  system  time,  5,  using  the  Breadth-first  Assignment. 


As  for  Sf ,  the  least  average  system  time  can  be  obtained  by  considering  the  process 
graph  as  a  rectangular  shape  since,  for  this  shape  of  process  graph,  the  utilisation  of  the  proces¬ 
sors  is  at  the  maximum.  The  total  number  of  tasks  in  a  process  graph  is  so,  the  width 

2ni 

(average  number  of  tasks  per  level)  of  the  rectangulw  process  graph  is  Thus,  the  largest 


number  of  jobs  that  can  be  processed  at  the  same  time  by  P  processors  is 


2m 


Thus, 


where  i  “  ^7 


Mini 


k. 


X>/2m 


In  the  numerator,  the  summation  term  represents  that 


every  L  units  of  time,  k*  jobs  are  completed;  the  second  term  in  the  numerator  represents  the 
time  required,  (L  iJ  +  1)  L,  by  the  last  (y  -  LiJ )  ifc*  jobs.  Hence, 


tyJ(UJ-n) 


S, 


If  we  assume 


J+l 


L,  and 


P 

X>/2m 


<  k,  then 


i  — 


k 

P 

L/2m 


kL 

2Pm 


kL*  Pm 

it  ^  2* 

Mi 


If  y  is  an  inteter,  then 


kL  Pm 
2Pm  2kL 


Otherwise, 


kl*  Pm 
2Pm  2k _ 

-KMilKlil-nU. 

J 


It  we  assume 


L/2m 


>  k,  then  /  ■■  1,  and  Sf  «  L,  or 


kL*  Pm 
2Pm  2k 


kL  Pm 
2Pm  2kL 


For  example.  Figure  4.12  shows  versus  P  for  L  «  10,  i;  s  5  and  m  ■■  i,  ud  Figure 
4.13  shows  i)  versus  P  for  L  »  10,  1; «  5  and  m  »  2.  We  observed,  in  both  Figures  4.12  and 
4.13,  that  the  value  of  after  falling  initially,  will  rise  slightly  before  monotonieally  decreasing 
again.  The  cause  of  this  rise  is  from  the  assumption  of  a  rectangular  process  graph  in  calculat¬ 
ing  Si .  Since  the  rectangular  process  graphs  have  a  constant  width  of  £/2m,  when  the  value  of 
P  reaches  a  multiple  of  L/2n^  an  additional  job  can  depart  at  every  L  time  step.  This  fact 
decreases  the  value  of  faster  than  the  value  of  S%  is  decreasing  whenever  the  value  of  P  is 


close  to  a  multiple  of  £/2m.  After  P  > 


then  this  effect  disappears. 


The  next  theorem  gives  the  asymptotic  behavior  of  0  as  P  and  k  become  large. 


Theorem  4-S 

As  k  and  P  become  large,  0  <  2. 


Proof: 
Case  I 
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2Pm  2k 

J±k  r 


where  j 


k 

P 

L/2  m 


kL*  Pm 
2Pm'*‘  2k 

JIL+L 

4Pm  2 


APm  2 

P  Pffi  Ij  Pfti 

Since  we  have  the  assumption  of  -r-rz —  <  4  or  <  —  <  L\  therefore,  -  L  <  0. 

ijjsffi  2lr  4  2I» 


Thus,  V’  <  2. 


For  this  case,  we  know  St  *■  L.  Therefore,  i>  ■■ 


j  1  1 


Pm  ,  kL  ,,  ,  . 


-  -rr-,  then 


(4.11) 


Plotting  ^  versus  x,  we  obtain  Figure  4.14. 

If  we  take  the  derivative  of  if  with  respect  to  a  (approximating  z  by  a  continuous  variable)  and 
set  the  result  to  zero,  we  can  find  the  z  at  which  ^  is  at  the  minimum: 

dz  4  z* 


The  minimum  occurs  at  z  2.  Thus,  since  z »  2  is  an  integer,  we  have  also  found  the 

minimum  of  d*  ns  the  integer.  As  long  as  z  <  8,  then  ^  <  2,  or  P  <  —  44 — . 

Lk  mm 

In  other  words,  as  long  as  the  number  of  processora  is  fewer  than  four  times  the  number  of  jobs 
multiplied  by  the  widest  part  of  the  process  graph  (~),  then  <  2.  Usually,  we  use  only,  at 

most,  4  [  —1  processors.  So,  if  we  let  P  <  4—,  then  z  ■»  •^rr"  <  — "  2.  Also,  since 


'  X  I 

7l£  ^8 

28 


Figure  4.14  0  versus  s 


we  have  assumed  P  >  x  ■ 
2m 


between  x  >■  1  and  2,  i>  < 


kL  , 

2Pm  2m 
Lk  Lm 

1 


>  1.  Again,  we  see  from  Figure  4.14  that 


From  Case  II  of  the  proof  in  Theorem  4.3,  we  know  that,  for  -yn —  ^ 

L/2iyt 

Pft%  kL  kt 

if  “  TrLT+TS — •  Therefore,  in  Figures  4.12  and  4.13,  the  values  of  if  for  P  > - are  defined 

2ku  Z/'fTi  2iii 

by  the  same  expression  as  Equation  4.11.  If  we  extend  this  curve  backward  for  smaller  values  of 
P  until  0  ■■  2,  we  obtain  an  easier  bound  on  if.  This  is  shown  in  Figure  4.15.  The  values  of 
P'  at  which  0  ■■  2  intercept  this  curve  can  be  calculated  from  the  following  expression: 


P' m 
2kL 


kL 

2P’ m 


or 


^-2P' 

2kL  2m 
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Solving  for  P'  we  get 


P'  -0.268— 


Therefore,  for  1  <  P  <  0.268—,  the  upper  bound  on  ^  a  2,  and  for 

lit 

kL  kL  Pm  hi 

0.268 —  <  P  <  — ,  the  upper  bound  on  ^  is  the  polynomial  — .  This  fact  is  given 

m  m  2kL  2Ptn 

in  the  next  Corollary. 


Thus,  we  see  that  the  ratio  of  the  average  system  time  for  the  worst  assignment  to  the 
best  assignment  is  given  by  Corollary  4.4.  This  ratio  is  quite  small.  Hence,  if  we  do  allow  ran* 
dom  scheduling  of  the  tasks  to  the  processon,  the  resulting  average  system  time  will  be  bounded 
relatively  tightly  by  S4  and  5». 


4.0  Discussion 

In  this  chapter,  we  discussed  two  methods  for  obtaining  the  average  system  time  and 
the  concurrency  measure  of  a  Bxed  process  graph  with  randomly  distributed  service  time  of 
tasks.  These  results  apply  to  both  a  fixed  number  of  jobs  at  time  sero  and  an  anival  of  jobs 
from  a  random  source  because  the  number  of  processors  is  assumed  to  be  infinite.  In  the  process 
of  finding  the  average  system  time,  however,  either  an  enumeration  algorithm  must  be  used  or  a 
system  of  a  large  number  of  equations  must  bq  solved.  Both  methods  are  time  consuming  when 
the  number  of  tasks  in  the  process  graph  becomes  large. 

We  can,  however,  use  the  upper  and  lower  bounds  on  tne  average  system  time  as  a  rule 
of  thumb  in  approximating  the  concurrency  measure.  A  relatively  easy  way  to  calculate  both 
bounds  has  been  presented. 


A  Stochtstk  Petri  Net  node!  wu  used  to  find  the  average  atilhatioB  of  proeesaon  for 
the  case  of  a  fixed  aumber  of  jobs,  fixed  process  graph,  random  task  service  time  and  limited 
anmber  of  processors. 

fwo  schedaling  algorithms  were  aaafyzed  to  find  the  ratio  of  the  worst  algorithm  to  the 
best  algorithm  in  terms  of  the  average  system  time.  We  found  this  ratio  to  be  less  than  two  for 
diamond'shaped  process  graphs. 


CHAPTER  5 
R»adom  Process  Graphs 

5.1  Introduction 

In  Chapter  4  we  studied  eases  where  the  process  graph  was  considered  to  be  6xed. 
Therefore,  the  analysis  and  the  system  parameters  obtained  are  good  for  only  one  particular 
process  graph.  When  we  change  to  a  different  process  graph  or  try  to  predict  the  general  system 
behavior  of  the  other  process  graphs,  the  results  from  a  single  process  graph  are  often  not  very 
helpful. 

In  Sections  5.3  and  5.4,  we  obtain  bounds  on  the  average  system  time  of  a  random  pro> 
cess  graph,  with  N  tasks  and  enough  processors  so  that  a  processor  will  always  be  available  any 
time  a  task  demands  it.  With  these  bounds  on  the  average  system  time,  we  can  find  the  bounds 
on  the  speedup  achievable  when  we  use  multiprocessors  to  process  a  fixed  number  of  jobs.  The 
speedup  is  defined  as  the  inverse  of  the  concurrency  measure  and  the  concurrency  measure  is 
still  defined  as  the  average  system  time  using  P  processors  divided  by  the  average  system  time 
using  only  one  processor.  In  Sections  5.5  and  5.0  we  will  assume  that  the  number  of  processors 
is  limited. 

In  our  first  model  in  Section  5.4,  we  will  assume  that  the  number  of  precedences  is  arbi¬ 
trary;  in  our  second  model,  this  parameter  is  fixed  to  a  constant.  In  the  former  case,  only  the 
arrangement  of  the  tasks  in  the  process  graph  is  studied. 

In  this  chapter  (except  in  Section  5.4.2),  we  assume  that  the  Af  tasks  do  not  include  the 
initial  and  the  terminating  tasks,  and  the  number  of  precedence  relationships,  M,  (if  given)  does 
not  include  the  precedence  relationships  between  the  initial  task  to  the  next  level  tasks  and  from 
any  task  into  the  terminating  task.  The  resulting  upper  and  lower  bounds  are  known  to  be  two 
average  task  service  times  smaller  than  the  actual  bounds.  This  change,  while  not  affecting  any 
of  the  results,  does  allow  a  clearer  explanation  without  worrying  about  the  two  additional  tasks 
at  the  boundary. 


In  Section  5.2.1  we  show  that  the  number  of  arrangements  of  the  tasks  for  a  process 
graph  with  a  fixed  number  of  tasks,  N,  is  2^^*)  and  that  the  number  of  arrangements  for  a  par¬ 


ticular  level,  r,  is  |  J .  From  the  construcUon  algorithm  described  in  Section  5.2.2,  we  show 

that  the  number  of  arrangements  for  an  N-task,  r>level  process  graph  forms  a  Pascal  tree.  Since 
the  number  of  arrangements  is  Gaussian  distributed  with  respect  to  the  number  of  levels  in  a 
process  graph  as  N  becomes  large  (this  is  proved  in  Section  5.2.3),  most  of  the  arrangements 
will  have  a  number  of  levels  which  (percentage-wise)  is  close  to  the  average  level,  namely 


The  Chernoff  bound,  introduced  in  Section  5.2.4,  will  be  used  for  the  probabilistic 


argument  in  Sections  5.3  and  5.4.1,  in  which  we  obtain  upper  and  lower  bounds  on  the  average 
system  time  for  a  randomly  selected  process  graph. 


With  the  number  of  precedence  relationships  (edges)  and  the  number  of  levels  added  as 
additional  parameters,  we  find  tighter  bounds  in  Section  5.4.2.  The  two  upper  bounds  obtained 
are  compared  in  the  Section  5.4.3. 


In  Section  5.5,  the  issue  of  trading  oS  between  the  utilixation  of  processors  and  the  aver¬ 
age  system  time  is  discussed.  Finally,  in  Section  5.6,  we  briefly  look  at  the  bounds  when  the 
number  of  the  processors  is  limited  to  a  finite  number. 


S.3  Some  Properties  of  Random  Process  Graphs 
S.2.1  Total  Number  of  Arrangements  with  N  Tasks 

Process  graphs  with  N  tasks  can  have  r*Bl,2,3,  ■  ■  •  ,N  levels.  The  only  constraint  on 
the  arrangement  of  the  tasks  is  that  each  of  the  r  levels  must  contain  at  least  one  task.  There¬ 
fore,  we  can  replace  the  question,  'How  many  ways  can  we  distribute  N  tasks  in  r  leveb  with 
each  of  the  level  containing  at  least  one  task’  by  the  following  simpler  question,  'How  many 
ways  can  we  distribute  (N-r)  tasks  in  r  levels.’ 

This  is  a  combinatorics  problem.  We  know  that  the  number  of  combinations  of  z  dis¬ 
tinct  objects  taken  y  at  a  time  with  repetition  allowed  is 

I  "*11 

In  terms  of  the  number  of  tasks  and  levels,  we  wish  to  find  the  number  of  combinations  of  r  lev¬ 
els  taken  (/V-r)  at  a  time.  An  intuitive  way  of  looking  at  this  is  to  observe  that  we  are  selecting 
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a  particular  level  for  each  of  the  {J^-r)  tasks.  Therefore,  letting  z  »  r  and  y  N-r,  we  have 


as  the  number  of  arrangements. 

When  summing  the  number  of  arrangements  over  all  levels,  we  get  the  total  number  of 
arrangements  for  the  Mtssk  process  graphs; 

#■4 


5.2.2  A  Method  of  Constructing  All  Arrangements  of  Process  Graphs  with  N  Tasks 

After  we  present  a  method  that  constructs  the  arrangements  of  process  graphs  by  using 
a  recursive  algorithm,  we  then  prove  that  this  algorithm  generates  all  arrangements  for  the  pro¬ 
cess  graphs  of  N  tasks.  From  this  coastruetion  method,  we  will  see  that  the  number  of  arrange¬ 
ments  for  an  N-task  and  r-level  process  graph  forms  a  Pascal  tree  (from  which  we  can  also 
obtain  the  number  of  arrangements  for  a  process  graph  with  N  tasks). 

The  construction  method  is  shown  in  Algorithm  C: 


Algorithm  C 

1.  For  N»l,  there  is  just  one  arrangement 

2.  For  N>2  tasks  and  r  levels,  we  add  to  all  the  arrangements  (possibly  none)  with  (/V-1) 
tasks  and  (r-1)  levels,  one  task  at  a  new  level,  the  r**  level;  we  also  add  one  task  to  the 
r'*  level  of  all  the  arrangements  (possibly  none)  with  (7V-1)  tasks  and  r  levels. 

3.  Repeat  Step  2  for  each  level  r  «  2,3,  ,N  to  obtain  all  the  arrangements  for  the  N- 
task  process  graph. 


Note  that,  in  order  to  construct  the  arrangements  for  the  Af-task  process  graphs,  we  must  also 
construct  all  the  i-task  process  graphs  where  i<N. 
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Figure  5.1  shows  examples  of  constructing 


a.  all  2  arrangements  of  the  2-task  process  graphs  from  the  one  1-task  process  graph,  and 

b.  all  10  arrangements  of  the  6-task,  3-level  process  graphs  from  the  4  arrangements  of  5- 
task,  2-level  process  graphs  and  6  arrangements  of  the  5-task,  3-level  process  graphs. 

Prom  Figure  5.1,  we  observe  an  interesting  property  of  the  arrangements  for  process  graphs. 
That  it,  for  any  arrangement  R,  there  exists  another  arrangement  Pf  which  is  symmetric  to  R 
such  that  if  in  arrangement  R,  we  let  n,  representing  the  number  of  tasks  in  level  i,  then  in 
arrangement  R'  the  number  of  taski  in  i**  level,  n, ,  is 

"i  ="t-i  +  i  for  1=1,  2,  •••  ,L 

where  L  is  the  total  number  of  levels  in  R  and  R'  . 
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Figure  5.1a  Arrangements  of  Process  Graphs  with  2  Tasks 


From  Section  5.2.1  above,  we  know  that,  with  N  tasks  there  are  arrangements 

with  r  levels.  The  above  algorithm  constructs 
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Figure  &.lb  Arrangements  of  Process  Graphs  with  6  Tasks  and  3  Levels 
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arrangements.  Hence,  if  we  can  show  that  all  of  the  ( ^|)  arrangements  are  unique,  then  we 
shall  have  obtained  all  the  arrangements  with  N  tasks  and  r  levels. 


Lemma  5,1 

All  arrangements  created  for  process  graphs  with  N  tasks  and  L  levels  using  the  Algorithm  C 
are  unique. 


Proof 


Assume  all  unique  arrangements  with  (Af-I)  tuks  and  r«l,2,  *  ■  ■  ,(Af>l)  levels.  To  construct 
the  arrangements  with  N  tasks  and  L  levels,  where  \<L<N,  we  add  a  task  to  the  new  L'*  level 
for  all  arrangements  with  (AC-1)  tasks  and  {L-1)  levels,  and  we  add  a  task  to  the  bottom  level  of 
all  arrangements  with  (Af-l)  tasks  and  L  levels.  The  arrangements  for  the  former  case  will  have 
only  one  task  at  the  L**  level,  while  the  arrangements  from  the  latter  case  will  have  more  than 
one  task  in  the  L'*  level.  Hence,  between  these  two  eases,  no  two  arrangements  can  be  identical. 


We  know  from  the  assumption  that  all  the  arrangements  with  (N-1)  tasks  are  unique; 
therefore,  the  resulting  arrangements  after  adding  a  new  level,  the  L**  level,  with  one  task  in  it, 
should  still  be  unique  in  the  former  case;  the  resulting  arrangements  after  adding  a  new  task  to 
the  level  should  also  be  unique  within  the  latter  case.  Thus,  all  arrangements 

obtained  by  our  algorithm  are  unique. 

m 


Theorem  5.2 

Ail  arrangements  created  by  Algorithm  C  for  N>task  process  graphs  are  unique. 

Proof. 

From  Lemma  5.1  we  know  that  the  arrangements  are  unique  within  each  level  r.  Since  arrange¬ 
ments  in  different  levels  cannot  be  similar  to  each  other  (this  is  due  to  the  constraint  that  each 
level  must  have  at  least  one  task),  no  two  arrangements  in  the  arrangements  created  by 
Algorithm  C  are  similar  to  each  other. 

m 

From  this  construction  method,  we  see  that  the  number  of  arrangements  for  N  tasks 
and  r  levels  actually  forms  a  Pascal  tree.  In  the  next  section,  we  show  that,  as  N  becomes  large, 
the  distribution  of  the  number  of  arrangements  with  respect  to  the  number  of  levels  is  Gaussian. 


S.2.3  Distribution  of  the  Number  of  Arrangements 

The  number  of  arrangements  for  an  N-task  process  graph  with  respect  to  each  level  is  a 
binomial  number.  If  we  analyze  the  distribution  of  the  tasks  as  a  random  variable,  Y,  such  that 
if  y,  »  1,  a  new  task  is  added  to  a  new  level;  if  Y,  ■■  0,  it  is  added  to  an  existing  level.  In  a 
Pascal  tree,  this  is  equivalent  to  going  either  to  the  left  (i.c.  the  number  of  levels  remains  the 
same)  or  to  the  right  (i.e.  the  number  of  leveb  increases  by  one)  of  the  current  location  in  the 
next  level  of  the  Pascal  tree. 


We  define  a  random  variable,  Y  ^  Y„  where 


Y, 


1  ufitA  probahUily  q 
0  wilh  probabUity  p  ^  1-q 


We  let  p»q»-^.  When  we  sum  N  such  random  variables,  we  obtain  the  Bernoulli  distribution 
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(jfc)  *i  ( ^)  which  has  the  mean  of  /V  ^  Nl2  and  a  variance  of  N  p  q  ^  Nf4  when 

p  as  In  terms  of  our  parameters,  when  K,  1,  we  add  the  new  node  i  to  a  new  level  of  the 

arrangement;  when  Y,  »  0,  we  add  the  new  node  i  to  one  of  the  existing  levels.  Summing  N  of 
the  random  variables  Y,  ,  we  have  an  arrangement  with  K  levels.  This  distribution  is  the  same 
as  the  distribution  for  the  number  of  arrangements  with  respect  to  the  number  of  levels  in  a 
process  graph  with  N  tasks.  If  we  normalize  this  distribution  so  that  the  mean  is  zero  and  the 
variance  is  one,  then  the  characteristic  function^  is  given  by  [MISE64) 


0(w)  «  pe  ^  +  qe~^ 


N 


where  i  *  y/~  1. 

.2 

Since  e'*  —  l+io — — + 
2 


4>{uj) 


14 » 


w 


w 

'7n\ 


~7n~  2 


4i 

2 


14 


~7n' 


(J 

'7n| 


^  Let  A’  be  a  random  variable  with  probability  distribution  function  '^^e  characteristic 
function  of  /l(z)  is  the  function  ^  defined  for  real  u  by 
00 

d(u;)  *  j  e*"*  dF\z)  «■  u(u;)4ttj(u>) 

-00 

where  i » 

00 

u(u/)  J  cosu/z  d/^z) 
and 

00 

ti(w)  =  J  $inu>xdF[z) 
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s 


where  0  (z)  denotes  any  function  which  goes  to  zero  faster  than  z,  that  is,  lim 

t—  0 

Take  the  limit  as  N  approaches  infinity  to  obtain 


0(z) 


- 

lim  Mu)  as  e  ^ 

M-oo 


0. 


. -i 

But  e  ^  is  the  characteristic  function  of  a  normalized  (i.e.,  mean  ~  0  and  variance  s  i  ) 
Gaussian  distribution.  Thus,  we  have  shown  that  the  number  of  arrangements  with  respect  to 
the  number  of  levels  is  Gaussian  distributed. 


S.S.4  Chcrnoff  Bound  on  the  Tall  Probability 

Suppose  we  want  to  know  the  probability  th^l  a  randomly  selected  arrangement  has 
more  than  y  levels.  The  Chernoff  bound  {KLEI75]  gives  us  a  very  good  bound  on  this  tail  pro¬ 
bability. 


First,  from  [KLEI7S]  we  find  the  moment  generating  function  for  the  sum  of  N  Bernoulli 
trials.  For  TV  »  1, 


«<•)  -  y+y'’ 

which  indicates  that,  with  probability  y,  we  add  a  new  level  (  e’* ),  and  with  probability  y,  we 
don't  add  a  new  level  (  e”*  ).  We  now  define  the  semi*invariant  generating  function 

7(v)  -IniM^w),-  In[y+ye''J 

The  Chernoff  bound  for  the  tail  of  our  density  function  is  given  by  [KLEI75|  as 

piy>  H  <  e- 

Since  this  inequality  is  good  for  any  value  of  v  >  0,  we  should  choose  v,  as  in  [KLEI75],  to 
create  the  tightest  possible  bound.  This  is  done  by  differentiating  the  exponent  and  setting  it  to 
zero.  We  then  find  the  optimum  relationship  between  i;  and  y  as 
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Heoce.  we  have 


Proh\Y>}li<i 


•*>(»)] 


N  1 

We  let  y  s  and  0<{<— .  la  order  to  find  the  optimum  relationship  between  v  and  y. 


we  let 


N- 


4*- 

2 


1+le' 

2  2 


N-^ 

1+e* 


Solving  for  v,  we  get 


V  «  In 


N-y 


In 


I 

2‘* 


Thus, 


Prob[Y>]^  =  Prob\Y>N-i^^\v)] 


Figure  5.2a  shows  several  curves  of  P  versus  c  with  various  values  of  N  where  P  is  the 


upper  bound  on  the  tail  probabilities  such  that  Prob 


r> 


as  N  increases,  the  probability  is  concentrated  closer 


•»  r-[^j 


P.  We  note  that, 


and  the  bound  on  the  tail 


probability  falls  faster.  Figure  5.2b  shows  the  probability  of  a  randomly  selected  process  graph 
having  more  than  y  levels. 


S.2.5  Generation  of  Random  Proeeie  Graphs 


Random  process  graphs  can  be  produced  by  at  least  the  following  four  methods.  In  the 
first  method,  we  are  given  the  total  number  of  levels  for  all  graphs  and  a  probability  distribu* 
tion  of  the  width  of  each  level.  After  selecting  a  random  number  of  tasks  for  each  level,  pre¬ 
cedence  relationships  are  created  by  randomly  connecting  the  nodes  of  the  adjacent  levels  with 
the  direction  of  all  edges  pointing  toward  the  terminal  node. 


P'[Y.^  yl 


2 


Figure  5.2b  Prob  (  F  >  y| 


For  the  second  method,  we  initially  produce  a  connected  undirected  random  graph  with 
a  given  number  of  tasks  and  edges.  We  next  select  a  given  node  as  the  initial  task  and  assign 
edge  directions  to  the  nodes  one  hop  away.  This  is  followed  by  assigning  edge  directions  from 
nodes  that  are  one  hop  away  to  nodes  that  are  two  hops  away.  This  procedure  is  repeated  until 
the  edge  of  the  node  farthest  away  from  the  initial  node  has  been  assigned  a  direction.  Any 
remaining  undirected  edges  can  have  either  direction.  Then,  in  order  to  conform  with  a  normal 
process  graph,  we  add  an  additional  terminal  node.  An  edge  will  be  added  to  this  terminal  node 
from  all  nodes  with  an  out-degree  of  tero. 


If  the  Dumber  of  precedence  relationships  are  large,  the  resulting  process  graph  will  gen* 
erally  have  a  large  number  of  levels.  This  is  due  to  the  constraint  that  no  precedence  relation¬ 
ships  are  allowed  between  any  two  tasks  on  the  same  level.  If  this  edge  does  exist,  one  of  the 

nodes  is  pushed  down  to  the  following  level.  In  fact,  if  the  number  of  edges  equals  , 

only  one  process  graph  can  be  generated  —  a  linear  chain  of  N  tasks. 

Drawing  from  the  theory  of  branching  process  [HARR63|,  the  third  method  uses  the 
Galton-Watson  branching  process  to  create  a  random  graph.  In  this  process,  level  one  has  one 
task.  Then,  for  each  task  at  level  i,  it  has  the  probability  Pt  to  create  k  new  tasks  at  the  (t>l)" 
level,  where  k«0,l,2,  ■  ■  ■  .  An  edge  connects  each  of  the  new  tasks  with  the  creating  task.  If  a 
task  creates  no  new  task,  it  is  extinguished,  and  it  has  a  precedence  relationship  to  any  task  on 
the  next  level.  A  difficulty  with  this  method  is  that  there  is  a  possibility  that  the  process  graph 
will  have  an  infinite  number  of  levels.  If  a  finite-level  process  graph  is  found,  it  is.  by  our 
definition,  a  structured  process  graph. 


Finally,  Dodin  (DODI8l|  proposed  another  method  of  creating  random  process  graphs 
with  N  nodes  and  M  edges.  In  his  method,  the  adjacency  matrix  that  represents  the  precedence 
relationships  is  created  in  the  following  two  ways: 


1. 


Deletion  Method 

Create  an  adjacency  matrix  with  the  upper  triangle  full  of  I’s.  Randomly  delete 


-  M  edges  on  the  condition  that  there  exist  at  least  one  edge  into  and  one  edge 


out  of  any  node. 


2.  Addition  Method 

Distribute  one  edge  to  nodes  (1,2)  and  one  edge  to  nodes  (Af-l,  N),  and  randomly  distri¬ 
bute  the  remaining  M-2  edges  to  the  upper  triangle  of  the  adjacency  matrix  on  the 
condition  that  there  exist  at  least  one  edge  into  and  one  edge  out  of  any  node. 


From  this  adjacency  matrix,  a  process  graph  is  generated  by  mapping  all  the  edges  in 
the  matrix  onto  a  set  of  nodes  enumerated  from  1  to  Af.  Because  only  the  upper  triangle  of 
the  adjacency  matrix  can  have  I’s,  the  resulting  directed  graph  is  also  guaranteed  to  be  acyclic. 
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Hence,  as  the  number  of  tasks  per  job  becomes  large,  the  concurrency  measure  takes  on 
the  value  of  one  half. 


S.4  Random  Task  Service  Times  (k,  G',  x*,  P  »  oo) 

5.4.1  Bounds  on  the  Average  System  Time  without  the  Number  of  Precedence  Reia> 
tionships 

In  this  section,  an  upper  bound  and  a  lower  bound  on  the  average  system  time  are 
found  for  the  random  process  graph  with  N  tasks  and  task  service  times  that  is  exponentially 
distributed. 


S.4.1.1  Upper  Bound 

We  wish  to  find  an  upper  bound  on  the  average  system  time  of  an  N>task  process  graph. 
From  the  Chernoff  bound  we  know  the  probability  of  a  randomly  chosen  process  graph  having 
more  than  y  levels  in  its  arrangement.  Thus,  for  a  specific  y,  if  we  can  find  an  upper  bound, 
then  this  bound  should  be  correct  for  any  process  graph  with  probability  of  l-Prok(Y>^.  In 
this  section  we  obtain  an  upper  bound  for  process  graphs  with  the  number  of  levels  equal  to  or 
less  than  y.  As  N-*oo  we  can  let  y  be  arbitrarily  close  to  (but  greater  than)  the  mean  number 
of  levels,  m,  and  the  probability  that  the  average  system  time  of  any  randomly  selected  process 
graph  will  be  smaller  than  this  upper  bound  will  approach  one. 

The  following  two  lemmas  provide  an  upper  bound  on  the  average  system  time  for 
arrangements  with  less  than  or  equal  to  y  levels  where  m<y<iV. 


Lemma  S.S 

N 

Given  N  tasks,  and  y  levels  for  a  process  graph,  if  we  assign  — >  tasks  to  each  level,  then  the 

V 

resulting  forced  tynekronixation  time  (FST)  is  the  maximum  average  system  time  with  respect  to 
the  other  arrangements  of  the  tasks  with  y  levels. 

Forced  synchronization  time  is  defined  to  be  the  time  required  to  process  a  process  graph  such 
that  each  task  in  a  given  level  is  being  blocked  by  all  the  tasks  in  the  previous  level.  In  other 
words,  we  are  forcing  the  tasks  to  be  executed  one  level  at  a  time  (with  all  tasks  in  a  given  level 
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waiting  for  the  slowest  task  in  the  previous  level  to  complete  before  they  all  start  execution). 


Proof  of  Lemma  5.3 

Since  the  tasks  have  been  assumed  to  have  exponential  service  times  with  a  mean  of  the 
blocking  time  of  a  task  with  d  tasks  blocking  it  is  (see  Section  4.3.1) 


With  n,  tasks  in  each  level  >,  i »  1,2,  ‘  *  ,y,  we  have  the  upper  bound  on  the  total  average  sys¬ 
tem  time  S  as 


Sob 


+  ... 
^  M  ",  i 


subject  to  the  conditions: 

i", — N 

and 

n,>0  \fj 

N 

We  must  show  that  n,  — ,  tor  each  level  j,  maximizes  Si/b- 


Assuming  »  — ,  for  ail  j,  gives  the  maximum  S(/Bt 


SvB 


Suppose  there  exists  another  arrangement  such  that  its  FST  S*  is  larger  than  Svb-  Let 


n,*.  nl 


»»»  "H-l* 


n« 


"Jui 


be  the  arrangements,  where 

n,*>—  for  l<i<s 

as 


for  *+l<i</ 


N 


n*<—  for  t+l<»<y 
» 


then 


- +W,  - w, 


y 


N 

where  w,>  1  Wi,  a  *  2  nf  -  a—  ia  the  total  number  of  additional  nodes  added  to  levels  I  to  a, 

y 


^  A 

and  b  *  (y-0—  n*  »  th*  total  number  of  nodes  taken  out  of  levels  (1+1)  to  y.  Since  the 


N 

total  number  of  tasks  remain  constant,  each  additional  task  over  —  for  a  level  t  where  t  is 

y 


between  1  and  a,  one  task  must  be  taken  out  of  another  level  j  where  j  is  between  1+1  and  y. 
Therefore,  a*  4.  Now, 


1 


1 


- H  - Kui, 

y  y 


w,  >  I 


and 


1 


1 


N  -  N 
—1  —w, 

y  y 


1  <  ut,  <  — 

“  y 


Thus, 


^+1 


-bl 


M 


N 


S(/g+o— 

ft 


i'+l 
y  y 


Since 


<  ^  I  have  S*KSf^  which  contradicts  the  assumption.  Hence,  we  have 


— +1  — -I 

V  y 


shown  that  the  arrangement 


y 


gives  us  the  maximum  FST. 


Ill 

Id  the  cases  when  is  not  an  integer  in  the  above  lemma,  the  arrangement  that  gives 

N  N 

the  maximum  FST  is  constructed  as  follows.  Let  u  «  Remainder  of — ,  w  »  Integer  part  of  — , 

V  y 

then  n,  =  uH-1  for  j  »  1,  2,  ,  •  •  •  .  u  and  n^^  w  for  j  »  u+1,  •  •  ■  ,  y.  From  the  proof  of 
the  above  lemma,  we  know 

—  U  X)  -  +  —  (»-“)  D  T  ^  “  I'  H  T 
N 

Therefore,  the  arrangement  b  —  Vy  still  gives  us  the  maximum  FST  for  any  process  graph 

having  y  levels  even  if  there  is  a  possibility  that  no  process  graph  can  have  fractional  nodes  in  a 
level. 


Lemma  5.4 

Given  the  FST  calculated  in  Lemma  5.3  for  an  Af-task  and  y-level  process  graph,  it  is  also  the 
maximum  FST  for  any  process  graph  with  less  than  y  levels. 


Proof 

This  lemma  can  be  formulated  as  the  following  nonlinear  optimisation  problem 


Max  S  “  yj 


JL 

1  *'»  1 
^  mml  >i 


tubjeet  to  yi<y 

"  1 

We  know  that  the  harmonic  series  (KNUT73a|,  2  “•  approximated  by 

■4  * 

— L_  ... 

,"i  »  2n  I2n* 


where  ^  is  the  Euler's  constant  («■  0.57721  *  *  '  ).  Therefore, 


ln^+*+ 


1 

* 

1 

y\ 

1  »il 

« 
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Now,  we  assume  Vi  is  continuous. 


-Ji 

_as  _  j_  yf  1 _ »i_ 

dy\^  jV 

Vi 


Vi  2N 


Vi 

y? 

Vi 

n' 

‘4/V® 

For  l<yi</V,  we  find  that  -^—>0,  or  the  slope  of  S  versus  pi  is  positive  in  the  region 
l<yj<yV.  Since  at  y,  *■  1, 


5- 


N 


1^ 

t 


we  have  5>0  as  an  increasing  function  with  respect  to  y,.  The  condition  Vi<y  implies  that  the 
maximum  S  occurs  at  V]  «  y.  Therefore,  the  maximum  FST  obtained  for  an  Mtask  process 
graph  with  y  levels  is  also  the  maximum  FST  for  all  Mtask  process  graphs  with  less  than  y  lev. 
els. 


M 


The  next  theorem  follows  as  the  result  of  the  two  above  lemmas. 


Theorem  S.S 

An  upper  bound  for  the  average  system  time  of  an  N-task  process  graph  with  y  levels  or  less  and 
exponential  task  times  with  mean  —  is 


This  upper  bound  is  good  only  for  process  graphs  with  less  than  or  equal  to  y  levels. 
But  since  the  number  of  arrangements  in  a  process  graph  is  Gaussian  distributed,  we  expect  that 

N 

for  y  >  — ,  as  N—oo,  the  probability  that  a  randomly  chosen  process  gioph  hns  a  higher  aver¬ 
age  system  time  than  SuiKy)  gets  smaller.  In  fact,  we  show  in  the  next  theorem  that  as  N-^oo, 
N 

and  y  =*  for  any  small  positive  6,  the  tail  probability  Prob\Y>}f^,  or  the  probability  that 

a  process  graph  has  more  than  y  levels,  approaches  zero. 


Theorem  5.6 

N 

For  1/  =■  where  i  is  a  real  positive  number,  and  Y  is  the  number  of  levels  in  a  randomly 

selected  process  graph,  we  have 


lim  Prob\Y>)^-*0 

N'  *00 


Proof: 

The  Chernoff  bound  gives  us 


PTob[Y>j^<e 


In  order  for  lim  Pro4|  I^l(J-*0  we  must  show 
/sr-*oo 


Since  v  ■ 


2  j 

- and  since  0<e<—  (from  Section  5.2.4),  we  have  i/>0. 

T"‘ 


Let  *  »  e’,  or  u  =  Inx,  where  z>  1.  The  inequality  we  must  show  becomes 


Taking  the  exponential  of  both  sides,  we  get 


IH^ 


(1+*) 


< 


For  J>  1,  z*  is  always  greater  than  j 


(1+*) 


Hence  we  have  shown  lim  Pro4jy>yJ  -*  0  for  any  5>0. 

iV^OO 


With  N  a  large  number,  we  can  then  state  that  the  upper  bound  on  the  average  system 


time  is 


lim  Prob 


«  "i  I  1  1 

1)4-  <  SvB<  NtA 

/I  »  u  t 


—  1 


where  >■  ~  ^  arbitrarily  small  number.  That  is, 


c  a  3  N 

4/1 


5.4.1.2  Lower  Bound 


For  an  N>task  process  graph,  if  we  are  given  the  number  of  levels  y,  then  we  know  that 
the  minimum  amount  of  average  processing  time  is  where  is  the  average  processing  time 

of  a  task.  With  respect  tirall  arrangements  of  the  process  graph  with  N  tasks,  this  average  pro¬ 
cessing  time  will  be  a  lower  bound  with  probability  l-Pro4|K<y).  Since  the  number  of  arrange- 

N 

ments  with  respect  to  the  number  of  levels  is  Gaussian  distributed  with  mean  ,  from  the  sym- 

N  N 

metry,  we  have  Pro^ly^-z — y|  ■■  Hence,  all  the  properties  of  the  Chernoff 

bound  discussed  in  the  last  section  can  be  applied  here  also.  SpeciScally,  we  can  let  N-*oo,  for 
arbitrary  small  6,  and  6>0,  then 


lim  Preb 
yv^oo 


N 


1 


This  is  true  since  the  tail 
5  siJL 

^“  77- 


probability  approaches  zero  as  N  becomes  very  large. 


Thus, 


05 


5.4. 1.3  DUcuasion 


In  last  two  sections,  we  see  that  for  N  »  1,  the  average  system  time  for  the  case 
k,  G*.  z*.  P  K  00  is  bounded  by 


3  N 

2  n  4  u 


with  high  probability.  In  terms  of  speedup,  it  is  bounded  by 


1  i  <  <T  <  2  . 

3  ~  “ 


So,  on  the  average,  the  best  speedup  we  can  achieve  is  two  and  the  least  speedup  is 


S.4.2  Upper  Bound  with  m  Fixed  Number  of  Precedence  Relatlonahlpa 

We  have  studied  random  process  graphs  without  considering  the  number  of  precedence 
relationships  in  the  previous  section.  We  have  obtained  some  general  properties  of  the  arrange* 
ments  of  the  tasks  for  process  graphs  and  bounds  on  the  average  system  time.  The  upper  bound 
and  the  lower  bound  obtained  are  probabilistic  such  that  as  the  number  of  tasks  becomes  large, 
the  more  certain  we  are  regarding  these  bounds.  However,  if  we  now  include  the  number  of  pre* 
cedence  relationships,  we  can  improve  these  bounds 

In  this  section,  the  number  of  precedence  relationships  are  introduced  into  the  model. 
We  will  obtain  a  tighter  bound  using  the  number  of  precedence  relationships,  the  number  of  lev¬ 
els  and  the  number  of  tasks  in  a  random  process  graph  as  parameters.  We  first  develop  the  idea 
of  minimally  connected  proeett  grapho.  The  number  of  edges  required  for  the  minimally  con¬ 
nected  process  graph  are  studied  as  well  as  bow  additional  edges  can  be  added  to  it.  Next,  an 
algorithm  is  presented  which  gives  a  construction  method  for  a  process  graph  G*  with  N  nodes, 
M  edges  and  L  levels.  We  will  prove  that  the  forced  synchronisation  time  (FST)  obtained  from 
process  graph  C7*  is  indeed  an  upper  bound  on  the  average  qrstem  time  of  all  random  process 
graphs  with  N  nodes,  M  edges  and  L  levels. 
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5.4.2. 1  Minimally  Connected  Process  Graph 

Given  any  arrangement,  the  A/ edges  can  connect  only  a  limited  set  of  tasks.  No  edges, 
for  example,  are  allowed  to  connect  any  tasks  within  the  same  level  of  a  process  graph.  In  Fig¬ 
ure  5.3,  we  see  the  26  legal  places  where  the  precedence  relationships  can  be  placed  in  a  particu¬ 
lar  3  level,  9  task  graph:  six  positions  between  the  first  and  the  second  level,  eight  between  the 
second  level  and  the  third  level,  and  twelve  between  the  first  level  and  the  third  level. 


Figure  5.3  Legal  Places  for  Precedence  Relationships 

If  we  allow  the  M  edges  to  be  randomly  distributed  among  all  these  legal  places,  we  often  find 
that  the  process  graph  is  not  even  connected.  Even  with  a  large  enough  number  of  edges,  such 
as  Af  >  N  logN,  there  u  no  guarantee  the  resulting  process  graph  is  connected  (although  it  is 
highly  likely). 

For  purposes  of  calculating  bounds  on  the  average  system  time,  the  underlying  arrange¬ 
ment  must  be  connected  such  that  each  node  is  maintained  at  its  proper  level  in  the  process 
graph.  A  node,  j,  is  said  to  be  in  i**  level  if  there  exists  a  shortest  path  from  the  initial  node  to 
node  j  such  that  the  number  of  nodes  in  this  path  equals  i.  We  define  M,  to  be  the  minimum 
number  of  edges  required  to  fix  all  the  nodes  of  a  particular  process  graph  in  this  proper  level. 
All  edges  are  between  nodes  of  the  adjacent  levels  instead  of  between  nodes  of  non-adjacent  lev¬ 
els  because  of  our  definition  of  level  and  because  of  the  following  Lemma. 


Lemma  5. 7 

The  blocking  time  of  an  edge  between  two  neighboring  levels  is  greater  than  the  blocking  time 
of  the  same  edge  between  two  levels  not  neighboring  each  other. 


Proof. 

Suppose  node  >  at  level  r  is  being  blocked  by  node  j  which  is  at  level  r  -  2.  If  there  exist  two 
other  edges  (j,k)  and  (k,i)  for  any  node  k  at  level  r-  1,  then  edge  (j,i)  presents  no  blocking  to 
node  I  (See  Figure  5.4)  for  the  following  reasons. 


LEVEL  r-2 


level  r  -1 


level  r 


j 


I 


k 


Figure  5.4  Reduced  Blocking  Effect 


As  soon  as  node  j  is  completed,  node  k  starts  execution.  Since  node  k  is  still  blocking  node  t,  the 
release  of  the  blocking  from  node  j  to  node  t  does  not  allow  node  i  to  start  execution.  If  no  such 
indirect  blocking  edges  exist,  we  notice  the  fact  that  the  blocking  effect  of  the  edge  (j,i)  is 
reduced  partially  by  the  average  task  time  of  other  tasks  in  level  r  -  1  which  do  block  node  i. 
Hence,  the  blocking  due  to  the  edge  (j.i)  is  not  worse  than  any  edge  (k,i)  if  node  k  belongs  to 
level  r  -  1. 


Since  we  will  be  looking  for  an  upper  bound  on  the  average  system  time  with  a  limited  number 
of  precedence  relationships  and  since  the  edges  between  adjacent  levels  result  in  greater  block¬ 
ing,  we  assume  all  edges  are  between  two  adjacent  levels. 


Figure  5.5  shows  some  examples  of  minimally  connected  process  graphs.  Each  edge  in 
Figure  5.5  is  necessary  in  order  to  fix  the  nodes  in  their  proper  levels  within  the  process  graph. 
By  deleting  any  one  edge,  the  resulting  process  graph  will  be  either  disconnected  or  at  least  one 
node  is  no  longer  in  a  path  from  the  initial  node  to  the  terminating  node. 


Figure  5.5  Minimally  Connected  Process  Graph 


From  Lemma  S.7,  we  can  calculate  the  exact  value  of  Af,  for  a  process  graph.  Since  we 
are  looking  for  an  upper  bound  of  the  average  system  time,  we  place  all  the  edges  between 
nodes  in  adjacent  levels.  DeOne  n,  to  be  the  number  of  tasks  in  level  t,  for  t  «  1,  2,  -  -  ,1. 

For  any  two  adjacent  levels,  say  levels  t  and  i  +  1,  if  n,  >  then  there  must  be  at 
least  n,  edges  between  i'*  level  and  (t  +  1)**  level.  Otherwise,  at  least  one  of  the  nodes,  say  node 
j,  in  the  i'*  level  will  have  no  edge  leaving  it  and  the  path  from  the  initial  node  toward  the  ter¬ 
minating  node  stops  at  node  j.  This  contradicts  the  definition  of  a  task  in  a  process  graph. 
Therefore,  there  must  be  at  least  n,  edges  between  levels  t  and  t  +  1. 

If,  on  the  other  hand,  n,  <  n,^  i.  then  there  must  be  at  least  n,  ^  ^  edges  between  t'* 
level  and  (i  -f-  1)**  level.  If  there  are  less  than  n,  edges,  then  at  least  one  of  the  nodes,  say 
node  j,  in  (i  1)'*  level  has  no  edges  entering  it.  It  cannot  be  in  any  path  from  the  initial  node 
to  the  terminating  node.  This  contradicts  the  definition  of  a  task  in  a  process  graph,  therefore, 
there  must  be  at  least  n,^i  edges  between  levels  i  and  t  +  1. 

By  selecting  the  larger  of  n,  and  n,  .^  |  to  be  the  minimum  number  of  edges  between  lev¬ 
els  I  and  I  ■+  1,  we  have  enough  edges  to  keep  all  nodes  in  the  and  (i  +1)**  levels  properly 
defined.  Summing  over  all  levels,  we  have 

Me «-  max  (  ni,n^  +  max  ( nj,njj  +  •  •  •  +  max  ( 

The  rules  for  making  the  minimally  connected  process  graph  are: 

Let  levels  >  and  i-hl  be  adjacent  in  the  process  graph, 

1.  n,  <  nH.1 

For  each  node  in  level  i  we  assign  an  integer  1,  2,  3,  ...  ,  n,  and  for  each  node  in  level 
t  -I-  1  we  also  assign  an  integer  1,  2,  3 . n,.^i.  Let  j  represent  a  node  in  level  t+1  and 

$  equal  to  the  remainder  of  Make  a  connection  between  node  t  of  level  >  and  node  j 
of  level  i  +  i. 

2.  n,  —  rtH-i 

For  each  node  in  level  t,  connect  it  to  any  node  in  level  i-¥l  with  indegree  of  zero. 

3.  n,  >  Hhh 

For  each  node  in  level  i  and  i+l,  we  assign  an  integer  1,  2,  ...  ,  n,  and  1,  2,  3,  ...  ,  n^.l 
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respectively.  Let  j  represent  a  node  in  level  t,  and  s  equal  to  the  remainder  of 
Make  a  connection  between  node  j  of  level  >  and  node  s  of  level  i+l. 


'•■+1 


In  all  the  cases  above,  each  node  in  the  level  has  at  least  one  edge  going  to  some 
node  in  the  (t  +  1)‘*  level  and  each  node  in  the  (t  +  I)'*  level  has  at  least  one  edge  entering  it 
from  some  node  in  the  i‘*  level.  By  extending  this  method  to  all  levels,  all  nodes,  except  the  ini¬ 
tial  and  terminating  nodes,  have  at  least  one  edge  entering  and  one  edge  leaving  it.  Hence  each 
node  is  on  a  path  between  the  initial  and  terminating  node  and  each  node  is  held  in  its  proper 
level  in  the  process  graph.  Thus,  we  have  a  method  for  creating  minimally  connected  process 
graphs. 


After  we  place  the  A/;  edges  into  the  arrangement  of  a  process  graph,  there  are  still 
many  pairs  of  nodes  between  the  adjacent  levels  where  an  edge  can  be  placed.  We  call  each  of 
these  pairs  as  the  empty  edge  slot  (EES).  AH  M-  Me  edges  are  placed  randomly  into  the  EESs. 

Two  lemmas  relating  the  values  of  Me  with  the  other  parameters  of  a  process  graph  are; 


Lemma  5.8 

For  any  process  graph  with  N  nodes  and  L  levels,  the  maximum  number  Me  occurs  in  process 
graphs  which  have  node  arrangements  such  that  the  levels  with  more  than  one  node  r.c 
separated  with  at  least  one  level  which  has  only  one  node.  In  these  cases, 

MaxMe^2N- L~l  (5.1) 


Proof. 

We  are  maximizing 

*  E  "h-i) 

with  respect  to  the  numbers  |  n,  | 

L 

Subject  to  N 

Consider  4  adjacent  levels  j-l,  j,  ;>1,  and  j-b-2,  such  that  n^i,  n^i,  are  respectively  the 
number  of  nodes  in  each  level.  We  assume  n,  >  n^i,  then  there  are  four  cases  relating  the 
values  of  n^.i  and  n,  and  the  values  of  and  n;.4.2  (The  arguments  for  the  case  <n;.f  i  is 
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similar  to  the  following  discussion).  We  must  show  that  the  arrangement  with 
n'  K  -f-  -  1  and  1  with  other  leveb  having  the  same  number  of  nodes  gives  us 

a  higher  Me- 

Case  1.  n,  >  _  1  and  4.  i  >  ";  +  2- 

In  calculating  Max  Me  for  these  four  leveb,  we  have  Z  edges  between  these  three  pairs 
of  leveb,  where 

Z  »  +  Hj  +  n,  4. 1 

But  if  we  change  the  arrangement  to  n^.|,  *1^4.1  -  1,  1,  nj4.2  then  the  number  of  edges 

becomes 

2  —  n,  +  fij4.,  -  1  +  fi,  +  n,4.,  -  1  +  n,4.2 
But  Z'  b  larger  than  Z  since 

Z  -  Z *  2n,  •+■  2  (n, 4. 1  -  1)  +  «,4.2“  (2n;  +  n;  +  i) 

“  "j  +  i  ~  2  +  ii,4.j  >  0 

Thus,  the  arrangements  of  n'^  «  4-  n,4.|  «  1  and  n^4.i  *  It  while  fixing  the  values  of  other 

n,'s,  will  give  a  higher  or  equal  value  of  Mf. 

Case  II.  n,.,  >  >  n, 4,1  >  Hj4.2  >  1 


For  this  case,  we  have 
Z  «  Bj .  I  -t-  n,  +  n,  4. , 

and  by  shifting  (b^  4.  (  -  1)  nodes  from  (J  +  1)*‘  level  to  level,  we  have 

,  fB,.i  -*■  B,  (n, 4.,  -  1)  +  11,4.2  if  B,.,  >  B,  +  n,4.,  -  I 

^  “12(b,  +  fl,4,i-  1)+  B,4.2  if  n;-l  <  «;  +  B,4.i  -  1 

and 

,  j",  +  S-I>0  if  B,.,  >  B,  +  B,4.i-  1 

^  1b,  +  B, 4.1-2  +  B, 4.2-  B,.,  >  0  if  B,.i  <  B,  +  B,4.i-  I 

Hence,  Z'  >Z  or  the  new  arrangement  will  give  a  possibly  higher  value  of 
Case  in.  1  <  B,  .1  <  b,4.i  <  b,  <  b,4.2 


[r, 


The  argument  for  this  case  is  similar  to  Case  II  above. 


Case  rV.  n,_,  >  >  fi, +  1  >  1  and 


In  this  case, 


Z*  n;-l  +  «;+  "j+a 

and  by  shifting  +  i  -  1)  nodes  from  (j  +  1)“*  level  to  level,  we  obtain 

(n,-i  +  ",  +  n,  +  i- 1  +  n,  +  2  ",-i  >  ",  +  ",  +  i  -  1 

2  av  { 

12  n,  +  2  fi,  +  i  -  2  +  n,^.j  'f  ",-j  <  ",  +  ",  +  i  -  1 


Therefore, 

,  f",  +  i-l>0 

^  -'^*lfi,+  2n,^.,-2-n,.,>  0 

or  the  new  arrangement  gives  a  higher  value  of  Mg. 


if  .  j  >  n,  +  n,  + ,  -  1 
if  ",  _  I  <  n,  +  n,  + 1  -  1 


So  for  all  four  cases,  we  can  always  obtain  a  larger  Max  M^  by  shifting  nodes  from  the 
(«,  +  i)**  to  ",**  +  ",  +  i  “  1  "od  ",Vi  “ 

To  see  that  Max  Mt  occurs  when  the  levels  with  more  than  one  node  are  separated  by 
at  least  one  level  which  has  only  one  node,  we  corsider  four  adjacent  leveb  again.  Suppose  the 
four  adjacent  levels  j  -  1,  j,  j  +  1,  and  j  +  2  have  1,  1,  ri,  +  a  nodes  respectively  where  n,  >  I 

and  ",>a  >  1.  In  calculating  Max  M^  for  these  four  levels,  we  have  a  sum  of 

Z  *■  +  tij  n, + a 

By  shifting  one  node  from  level  to  (y  -f  1)'*  level,  this  summation  becomes 
Z'  »  (n,  -  I  )  +  (n,  -  I  )  +  Max  (2,  n,^.  j  ) 

»  2  H;  -  2  +  Max  (2,  ", +a ) 
and 

Z  ~  Z'  ^  2  -i-  rij  Max  (2,  2  )  ^ 

The  value  of  Me  is  smaller  when  we  shift  one  (or  more)  node  from  the  level  to  the  (;  -f  I)** 
level.  Hence,  Max  Me  occurs  when  the  levels  with  more  than  one  node  are  separated  by  at  least 
one  level  which  has  only  one  node. 
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The  simplest  arrangement  of  the  N-node  and  L>Ievel  process  graph  with  this  property 
has  y  -  (L  -  1)  nodes  in  one  level  between  the  second  and  the  {L  -  1)'*  level  and  one  node  in 
each  of  the  remaining  (L  -  1)  levels.  With  this  arrangement,  we  see  that 

Maa  Wc  -  (L  -  3  )  +  2  (Af-  L  +  1  ) 
or 

MaxM,^2  N~  L~1 


Lemma  5.9 

The  minimum  Me  occurs  when  the  nodes  are  distributed  evenly  among  all  levels.  Let 


Then 

Min  Me  -  [(r  +  l)a  +  (Z,  -  r  -  2)^ 

Proof. 

We  are  minimizing 
Min  Me 

with  respect  to  the  numbers  |  n, 

L 

Subject  2  n,  «  yV 


—  5]max(n„n^J 


Suppose  the  Min  Me  occurred  in  an  arrangement  where  four  adjacent  levels 
i  -  L  i<  i  L  Md  ;  -f  2  have  an  equal  number  of  nodes, 


"j  - 1  “  "j  “•  + 1  “  +  J 

Now  we  show  that  by  moving  a  node  from  the  level  to  the  {j  +  1)'*  level,  the  value  of  M.  for 
the  resulting  arrangement  will  be  larger. 

In  calculating  Max  M,  fof  these  four  levels,  we  have  a  sum  of 

^  1  +  ";  +  "j  +  I  “  3  rtj 

By  moving  one  node  from  level  to  {j  +  1)**  level,  we  have  a  different  sum  of 
Z'  —  n,.i  +  (fi;  +  1)  +  (n,  +  1)  —  3  nij  +  2  >  Z 
The  same  is  true  in  cases 

1.  n,.i  —  n,  —  —  land  n,+2— » 

2.  n,.  1  »  n,  »  X  and  B^  +  i  *  n,4.2  »  jr 

3.  fi, .  1  «  X  and  «;■•«;  + 1  •  n, +2  *  y 

Hence,  we  see  that  if  we  try  to  assign  an  ’equal’  number  of  nodes  to  each  level  for  all  levels,  we 
obtain  the  minimum  Mg.  Thus,  besides  one  node  in  the  initial  level  and  one  node  in  the  ter¬ 
minating  level,  each  of  the  first  x  successive  levels  will  have  x  nodes  and  each  of  the  other 
(Z,  -  X  -  2)  levels  will  have  y  nodes.  Finally,  we  have 

Min  Me  —  [(x+l)x  +  (Z«-x-2)yj 

III 


S.4.2.S  An  Upper  Bound 

In  Section  5.4.1. 1,  an  upper  bound  for  the  average  system  time  was  obtained  from  the 
arrangement  with  an  equal  number  of  nodes  per  level.  Indeed,  this  is  also  true  in  the  case  of 
process  graphs  with  a  fixed  number  of  edges.  We  show  this  in  the  next  theorem  after  giving  an 
algorithm  (Algorithm  A)  which  constructs  an  N>node  process  graph  with  A/ edges  in  L  levels. 

ALGORITHM  A 

STEP  1:  Distribute  one  node  for  the  initial  task  and  one  node  for  the  terminating 

task. 
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STEP  2: 


Let 


z  s  Remainder  of 


N-2 

L-2 


N-2 

L-2 


I  N-2 

For  each  of  the  levels  2,  3 . z-f  1.  place  z  nodes  in  them,  and  for  each  of  the  levels 

2+2,  2+3 . L-l,  place  y  nodes  in  them. 


STEP  3:  Use  Me «  (^l)z  +  (L-s-2)y  edges  to  minimally  connect  this  arrangement 

of  tasks. 

STEP  4:  Randomly  distribute  the  remaining  edges,  M  -  M^  uniformly  among  the 

EES's. 


Theorem  5.10 

If  we  construct  a  process  graph  with  N  nodes,  M  edges,  and  L  levels  according  to  ALGORITHM 

A,  and  assuming  each  task  has  an  exponential  service  time  distribution  with  mean  — ,  then  an 

n 


upper  bound  on  the  average  system  time  is 


where 


ij  IB  the  maximum  number  of  blocking  edget  into  a  node  in  level  j 

EESJ 


min  { 1 


tin  I 

bj  max  J  f  it  a  node  in  level 


b,  B  indegree  of  node  q  from  the  minimally  connected  proeesa  graph 


EESj  number  of  empty  edge  slote  between  levels  j-l  and  J 


4 


EESr  *  total  number  of  empty  edge  slots 
n,_,  «  number  of  nodes  in  level  i-1 

nj  —  1 

Proof. 

Case  1)  For  M  >  max  Me 

In  this  case,  there  are  enough  edges  such  that  bj  equab  the  number  of  nodes  in  the  pre* 
vious  level,  n^i,  for  all  levels  j  >  2.  Therefore  t,  »  n^i  for  all  j  >2.  The  proof  that  the 
arrangement  generated  by  ALGORITHM  A  gives  the  largest  FST  is  similar  to  the  proof  of 
Lemma  3.3  of  the  last  section.  Therefore,  we  only  need  to  show  that  with  M  >  max  Me  we 
have  enough  edges  for  each  level  to  force  the  maximum  indegree  of  a  node  in  that  level  to  be 
equal  to  the  number  of  nodes  in  the  previous  level  (i.e.  b,  mt  n^j  ).  The  number  of  extra  edges 
is 

M-  Me 


>  Mas  Me  -  Min  Me 


where  z  is  the  remainder  of 
smaller,  we  shall  let  r  ■■  0. 


(r-H) 

yv-2 


.V-2 

L-2 


«(sll 


L— 2 


+  (L-r- 

.  Since  for  z  not  equal  to  zero,  the  value  of  Min  Me  becomes 


The  inequality  we  wish  to  prove  is  then 

E-3  ”  L-2 

The  left  hand  side  represents  the  number  of  additional  edges  remaining  after  Mas  Me  -  Min  Me 
have  been  distributed  equally  between  the  L-2  levels  plus  one  edge  representing  the  minimal 
connection  of  the  node.  The  right  hand  side  represents  the  number  of  nodes  per  level.  Multi* 
plying  both  sides  by  (L-3),  we  get 
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2^>(-^|(2/^)  +  4 

^  2NL-iN_„^^ 

L-2 

Which  clearly  is  true.  In  other  words,  we  have  enough  edges  to  force  the  maximum  blocking  for 
each  level.  Hence  the  FST  obtained  from  ALGORITHM  A  does  give  an  upper  bound  for  the 
average  system  time. 


Case  2\  Min  M.  <  M  <  Max  M, 


y 


Let 

j— i  ' 


yjo)-  1 

L  L 

We  are  maximizing  subject  to  ^  d|  <  eonatant,  where  d,  is  the  maximum  iadegree  at 

level  q.  From  the  proof  of  Lemma  5.3,  we  find  that  the  arrangement  of  d,  that  gives  the  max¬ 
imum  sum  of  y(d,)  is 


j  constant 


j»3,4,  •  •  ,L-1 


and  df  »  0  and  dg  »  1.  This  arrangement  is  exactly  what  ALGORITHM  A  generated.  So  for 
Min  Me  <  M  <  Max  M„  Sug  obtained  from  the  FST  of  the  arrangement  generated  by  ALGO¬ 
RITHM  A  is  also  an  upper  bound. 

Ill 


Figure  5.0  gives  the  minimally  cminected  process  graphs  for  some  of  the  arrangements 
with  Af «  10  and  L  «  6.  Table  5.1  shows  simulation  results  and  the  predicted  upper  bound  for 
the  average  system  time  with  Ad  »  12  for  the  process  graphs  with  the  arrangements  shown  in 
Figure  5.6.  Since  the  arrangement  of  nodes  in  Case  I  in  Figure  5.6  requires  a  minimum  of  13 
edges  to  be  minimally  connected  and  A#  <  13,  no  process  graphs  can  be  created  with  12  edges, 
and  we  cannot  8nd  its  average  system  time. 


Case 

Simulated  .\verage 
System  Time 

Calculated 
Upper  Bound 

I 

not  feasible 

378.7 

II 

364.24 

378.7 

III 

362.98 

378.7 

IV 

367.53 

378.7 

Table  5.1 


5.4^  ComparUon  of  The  Two  Upper  Bounds 

Two  upper  bouods  have  been  obtained  with  different  parameters.  In  Section  5.4.1. 1,  an 
upper  bound  was  obtained  through  the  probabilistic  method  on  the  likelihood  of  a  process  graph 
having  a  certain  number  of  levels  or  less.  In  Section  5.4. 2.2,  another  upper  bound  was  obtained 
with  a  fixed  number  of  nodes,  edges,  and  levels.  Since  the  latter  method  uses  more  information, 
its  bound  should  be  tighter  than  the  bound  obtained  by  the  former  method. 

Let  5(/b  represent  the  upper  bound  obtained  in  Section  5.4. 1.1  and  represent  the 
upper  bound  obtained  in  Section  5. 4.2.2. 


Theorem  5.11 
^UBM  ^  ^VB  if  i  ^ 


Proof. 

Case  \)  L  < 


having 


F rom  the  calculation  of  SvBt  know  that  5(/p  is  the  largest  FST  for  any  process  graph 

'"1  i„.b. 
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Since  the  number  of  levels  is  smaller  than 

is  also  the  maximum  FST  for  all  levels  less  than 


,  we  have  proved  in  Lemma  o.-l  thafa  Sus 
Hence,  Subu  <  Sug. 


Case  2)  L 


In  this  case,  the  arrangements  used  to  obtain  both  bounds  are  the  same.  If  the  number 
of  edges,  M,  is  greater  than  or  equal  to  max  Mt,  Sug^  is  obtained  with  max  FST  of  this  arrange¬ 
ment  or 

SuBU  "  S(jg 

If  M  <  max  there  will  not  be  enough  edges  at  every  level  for  the  maximum  FST.  Thus, 
SuBU  <  ^UB 


F rom  the  above  two  cases,  we  conclude  Sug^g  <  5ta  if  L  < 


When  L  > 


,  it  is  likely  that  5(^8^  >  The  upper  bound  5^^^  uses  more 


inoformation  but  it  is  a  worse  bound.  This  implies  we  did  not  use  the  information  optimally  in 
calculating  S^gjif.  The  inequality  (S(/gM  >  Sf/g)  is  not  always  true  because  with  different  param- 

V 

eters  of  JV,  M,  and  L,  it  is  possible  to  construct  counter  examples.  Also  note  that  since  L  » 


due  to  the  law  of  large  numbers,  the  case  {L  » 


H 

2 


)  is  not  likely  to  occur. 


5.S  Trade  Off  between  Averafe  System  Time  and  Utilisation  of  Processors  for  a 
Diamond-shaped  Process  Graph  (k,  G*,  z,  P  <  oo) 

Given  a  process  graph,  if  we  have  enough  processors  such  that  the  number  of  processors, 

P,  assigned  to  one  job  is  greater  than  Msz  |  n, 

required  to  process  L  levels  of  nodes.  The  utilization  of  the  P  processors  is,  however,  very  low. 
In  order  to  utilize  each  processor  more,  on  the  other  hand,  the  average  system  time  will 


I ,  then  the  average  system  time  is  just  the  time 
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increase.  In  this  section,  we  will  study  this  trade  off  by  using  the  notion  of  "power”  defined  in 
Chapter  4.  The  average  task  time  is  assumed  to  be  constant.  So  we  define 


Power  ■ 


P 

SI  R 


where  p  is  the  efficiency  of  the  P  processors,  5  is  the  average  system  time,  and  R  is  the  constant 

5 

task  service  time  (i.e.,  -=  is  the  normalized  average  system  time). 


If  we  can  assume  that  the  shape  of  process  graphs  is  bounded  by  a  parallelgram  (as  in 

Figure  5.7)  which  can  be  characterized  by  two  parameters,  L  and  m  (where  L  is  the  number  of 

leveb  in  the  process  graph  and  m  is  the  slope  of  the  diamond  enclosing  the  boundary  tasks  in 

u 

this  process  graph).  If  the  number  of  levels,  L,  is  even,  then  we  assume  the  |  -y  |  level  and  the 


-f  1  level  have  the  same  number  of  tasks. 


Let  n  (f)  denote  the  number  of  tasks  on  the  /  level,  and  let  n  (1)  ■■  1.  Then,  we  know 


Tn 


1  <  f  <  —  and  L  it  even 
2 

1  <  /  <  — ^  li  and  L  it  odd 

2 


n(/)-n(/+l)«»2  — 

TFl 


—  <  f  <  L  and  L  it  even 


i+  1 


<  I  <  L  and  L  it  odd 


Hence,  for  an  even  number  of  levels  L, 


«(/) 


1  +  — 
m 

m  I  2 


m  2 


1<<<4 


■|  +  1</<L 


and  for  an  odd  number  of  leveb  L, 


and 


n(/) 


1  + 

2^ 

m 

(/-I) 

1  <  /  <i 

1  + 

2 

_  2, 

4-*-  ^  < 

m 

1  2  -‘1 

m 

2  ^ 

L  +  1 


<  t<L 


For  example,  let  L  9  and  m  a  2,  then 


„(/)-l  +  -L( /_!)«/ 
fn 


-  ,  ^  L  +  1 

for  I  <  /  <  — - — 


10-  / 


fori^-i  <  l<  L 


Therefore,  given  m  and  L,  the  total  number  of  tasks  iV  in  a  process  graph  is 


N-- 


L 

T 

'.5. 

t- 1 
2 

2  S 


m 


i  +  — (1-1) 
m 


t  even 


m  2 


£  odd 


r  f  .  2 

L  ( 

1  m) 

m  1 

1  -  -i-| 

m  ) 

define 

m 


L  even 
L  +  1 


+  1  L  odd 


B  »  <Ae  width  of  the  proeest  graph  at  the  widest  level 


m  m 


1  + 


L  +  1  2 


m  m 


L  even 
L  odd 
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f>i  *  efficiency  averaged  over  aU  Ike  time  when  not  all  P 


procestors  are  busy 


—  ^  ^ 


H 


and 


_1_ 

P 


m  m 


N-2  £ 


m 


N-2H\  1-—  _  1  ) 

‘mm  ' 


Note  that  when  P  —  1.  then  //  —  1  and  we  expect  p  —  1;  this  can  be  verified  from  the 
above  expressions.  Furthermore,  if  P  -  B,  we  have  for  even  number  of  levels  L. 

P\ 


2  m  +  L-2 
2  m  +  2  L  -  4 


and  for  L  odd,  we  have 


-Lfi-i- 


m  +  L  -  1 


2  p,  -  1 


2m  +  L  -  2  +  — 

_ AP 

2m  +  L-2 


For  both  cases  of  p,  we  see  that 


Um  p  — 
L->  00  2 


This  can  be  observed  from  the  fact  that  the  area  occupied  by  the  diamond  is  exactly  half  the 
area  of  a  rectangle  with  L  levels  long  and  B  processors  wide.  The  reason  that  p  is  not 
exactly  one  half  for  small  L  is  due  to  the  fact  that  a  full  task  might  not  be  able  to  exist  on  the 
boundary  of  the  diamond. 


As  an  approximation,  we  assume  each  level  has  a  continuum  of  tasks.  We  now  find  the 
number  of  processors  that  maximizes  power  R. 


Let  »■  be  the  index  of  the  levels  in  G  and  let  n,  be  the  number  of  tasks  in  level  i,  then 


B-±. 


We  find 


Sum  of  the  fraction  of  P  proeestort 
kept  buoy  during  the  processing  /»me( 


total  time  required  to  process  this  job 


-4-(i+v'i  +  lV(i-p  )l 

p 


Interestingly,  5  is  not  a  function  of  m  when  optimal  number  of  processors  are  used.  For 
a  specific  utilization  p'  we  obtain  the  same  5  regardless  of  the  value  of  the  slope  m. 

As  we  know  [KLEI79|  the  maximum  power  is  achieved  at  the  point  where  the  5  versus  p 
curve  intersects  a  straight  line  from  the  origin  approaching  from  the  right  (see  Figure  5.8). 


s 


From  this  method  we  can  find  the  p*  at  the  intersection  and  substitute  back  into  the  expression 
for  P  to  obtain  the  optimal  number  of  processors 


The  S  versus  p  curve  does  not  start  from  p  v  0  because  the  minimum  p  occurs  when 


B 


Min  0 


1  + 


L» 

2mB 


B  ^  L*  a 
m—  +  —wo 
2  2 


In  addition,  because  no  jobs  can  be  finished  in  less  than  L  normalized  units  of  time,  we  have 
S>L. 

Figures  5.9-5.13  show  some  examples  of  the  average  system  time  vetsns  utilization 
curves  for  several  values  of  L.  Table  5.2  compares  the  optimal  value  of  P*  solved  from  the 
equation  with  the  value  calculated  by  using  the  p*  obtained  from  the  Figures  5.9-5.13. 


^0 

P* 

P*  (exact  solution  with  m  m  l) 

10 

0.82 

6.284 

O 

11.98  -i- 

_ w  _ 

12.047 

30 

0.76 

EH 

17.824 

40 

0.76 

23.594 

29.369 

Table  5.2 

The  P*"*  obtained  from  Figures  5.9-5.13  theoretically  should  be  the  same  as  P^s  calcu* 
lated  from  the  equation  and  they  are  very  close  to  each  other.  But  more  interestingly  is  the  fact 

that  P*  is  approximately  equal  to  0.6-^.  In  other  words,  we  should  provide  a  number  of  proces- 

ni 

sors  for  each  job  equal  to  six  tenths  of  the  number  of  leveb  L  divided  by  the  slope  of  the  process 


Substituting  the  value  of  P'into  the  expression  for  the  average  system  time. 


P*  m 

2  2  m  P*  ’ 


we  have  an  approximate  average  system  time 


S*- 


0.6  L  L* 

2  2(0.6)L 


—  1.133  L 


Note  that  if  we  do  not  care  about  the  low  utilization  of  processors,  then,  by  using  maximum 
number  of  processors,  the  average  system  time  is  5  L.  The  slightly  larger  5*  is  the  trade  off 
between  the  utilization  of  processors  and  the  average  system  time  of  jobs.  Furthermore,  we 
note  that  the  approximate  concurrency  measure  is 


<T 


S(P') 
S(1  ) 


1.133  L 

JL 

2  m 


2.266  m 
L 


5.0  Bounds  on  the  Average  System  Time  with  a  Limited  Number  of  Processors 

(*.  G%  z,  P  <  oo) 

In  this  section,  we  limit  the  number  of  processors  to  a  constant  P.  With  a  finite  number 
of  processors,  we  have  the  problem  of  scheduling  tasks.  When  more  than  one  task  demands  a 
processor  and  only  one  processor  is  available,  we  are  forced  to  pick  one  task  to  be  processed. 
The  method  of  selecting  which  task  to  be  processed  next  in  general  effects  the  average  system 
time  of  the  job. 


Assuming  unit  task  processing  times,  a  lower  bound  on  the  average  system  time  can  be 
easily  calculated  as 


^LB 


max 


where  L  is  the  number  of  levels,  N  is  the  number  of  tasks  per  process  graph,  k  is  the  number  of 
jobs,  and  P  is  the  number  of  processors. 


A  very  loose  upper  bound  can  be  obtained  by  using  the  Longest  Expected  Processing 
Time  First  assignment  algorithm  [COFF76].  Assume  P  <  k,  and  that  we  assign  only  one  pro* 
cessor  to  a  job.  Whenever  a  job  is  completed,  the  processor  looks  for  another  unstarted  job  to 
process.  Idled  processors  are  not  allowed  to  assist  other  jobs.  Since  only  one  processor  works  on 
a  single  job.  the  structure  of  the  random  process  graph  does  not  influence  the  execution  time. 
Each  job  will  take  the  same  amount  of  processing  time  ot  N  z  units. 


t  s  =  and  I  «  remainder  of  then  an  upper  bound  is 
Sub  -  +  {2N)P  +  •  •  •  +  +  (»+l)Nij 


If  we  allow  random  tasks  time,  with  the  distribution  F(t)  and  mean  of  X,  and  if  we 
require  synchronization  of  P  jobs  (i.e.  all  P  jobs  must  all  finish  before  starting  anotner  P  jobs), 
then  we  have  the  Longest  Expected  Remaining  Processing  Time  First  assignment  algorithm 
1C0FF76|  which  gives 

Sub  -  [MiP  +  ( +  A/^  P  +  •  •  • 

+  I  A/j  A/j  +  •  •  •  +  Af^  P 
+  I  A/j  +  M2  +  •  •  ■  +  M,  +  M^  — 

where 

Ma  -»  max  (  S  J  a  —  1,  2,  •  •  •  ,  * 


and  all  Mg  kaue  the  same  distribution 


iVf,  =s  max  (  -V,j 


X,,  «  random  task  time  for  task  j  of  job  i 


From  the  Law  of  Large  Numbers, 


lim  Prob 

i^QO 


N 

S 


N 


-jA< 


1  «>o 


Hence, 


lim  Prob\ 

S—oo 


JUB 


NXP  +  iNXP  -¥NXP 


or 


lim  Prob 

W-ao 


j]  «;vjep+  (*+  \)NKy 


Sub 


lim  ProH 
w-*« 


5t® ' 


+  (z+1)k  NJ? 


1 


The  bounds  obtained  in  this  section  are  very  loose.  The  minimum  average  system  time 
is  known  to  be  achieved  by  the  Shortest  Expected  Remaining  Processing  Time  First  (SERPT) 
ICOFF76]  scheduling  algorithm.  But  with  random  process  graphs,  we  don’t  know  the  exact 
structure  of  each  process  graph  in  order  to  apply  the  SERPT. 


S.7  Dlseusslon 

In  this  chapter,  we  have  attempted  to  observe  some  properties  of  the  arrangements  of 
random  process  graphs.  We  found  a  method  to  construct  all  the  arrangements  of  the  tasks  for  a 
process  graph  with  N  tasks.  The  distribution  of  the  number  of  arrangements  with  respect  to  the 
number  of  levels  was  shown  to  be  Gaussian.  The  tail  probability  of  this  distribution  was 
bounded  by  the  Chemoff  bound.  Next,  an  upper  bound  and  a  lower  bound  for  the  average  sys> 
tern  time  were  obtained  for  a  specific  number  of  levels  in  a  process  graph.  As  N  becomes  large, 
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the  probability  that  the  average  system  time  of  a  randomly  chosen  process  graph  is  between  the 
upper  bound  and  the  lower  bound  calculated  near  the  mean  number  of  levels  approaches  one. 

The  number  of  precedence  relationships  and  the  number  of  levels  were  added  to  the 
model  next.  The  bounds  for  this  case  were  found  and  compared  to  the  previous  bounds. 

We  used  the  notion  of  power  to  study  the  trade  off  between  the  utilization  of  the  pro¬ 
cessors  and  the  average  system  time.  A  very  loose  upper  bound  was  presented  for  the  case 
where  the  number  of  processors  is  finite. 


CHAPTERS 

Process^ommunicatioD  Graphs 


In  the  previous  two  chapters,  we  have  assumed  that  the  cost  of  sending  data  between 
processors  is  free  and  that  the  communication  between  them  can  be  achieved  instantaneously. 
In  reality,  there  is  always  some  delay  occurred  when  communicating  between  the  processors. 

Gentleman  (GENT78|  found  that  in  a  multiprocessor  environment,  even  though  data 
paths  are  provided  to  move  data  between  processors,  data  from  one  processor  is  only  immedi- 
ately  available  to  a  small  number  of  other  processors,  and  in  general,  moving  data  from  one  pr<^ 
cessor  to  another  requires  several  submoves.  Gentleman  uses  the  matrix  multiplication  on 
ILLIAC  rv,  where  processors  are  connected  in  a  two  dimensional  rectangular  grid,  as  an  exam* 

1^  ll  1 

— —I  -  —  data  movements  are 

required.  Hence,  we  cannot  ignore  the  communication  cost  in  general. 

To  minimize  the  communication  cost,  we  will  try  to  assign  as  many  tasks  as  possible  on 
a  processor.  Of  course,  there  will  be  no  communication  cost  if  all  the  tasks  are  assigned  to  a 
single  processor.  On  the  other  hand,  we  are  looking  for  maximum  concurrency  which  will  tend 
to  use  as  many  processors  as  possible  to  execute  the  tasks.  A  compromise  must  then  be  made  to 
balance  between  these  two  opposing  objectives.  Consequently,  we  can  no  longer  assume  that 
there  are  an  infinite  number  of  processors.  With  a  limited  number  of  processors,  the  need  for 
task  assignment  comes  back. 


In  this  chapter,  we  still  use  the  process  graph  discussed  in  Chapter  3.  Except,  we  will 
add  in  the  communication  tasks  that  represent  the  communications  required  between  the  proces¬ 
sors.  We  will  look  at  how  the  number  of  processors  will  affect  the  average  system  time  and  how 
we  can  obtain  the  average  system  time  by  converting  the  process-communication  graph  into  a 
Markov  Chain  (similar  to  Algorithm  CPM  in  Chapter  4,  but  with  some  differences  due  to  the 
limited  number  of  processors).  We  do  not  address  the  task  assignment  problem  specifically. 
Instead  a  simple  rule  of  thumb  is  used  in  deciding  which  processor  a  task  should  reside  in  and 
where  the  communication  tasks  are  added  in  the  process  graph. 


5.1  Communication  Taika  {k,  G,  2*,  P  <  oo  ) 

We  have  assumed  in  Chapters  3,  4  and  5  that  the  tasks  in  a  process  graph  may  be 
assigned  to  any  processors  and  that  there  is  no  communication  cost  of  passing  data  between 
tasks  due  to  the  contention  on  the  communication  bus  or  the  physical  distance  between  any  pair 
of  processors,  in  this  section,  we  explore  a  way  to  represent  this  communication  cost  in  the 
model  discussed  in  Chapter  4. 

We  assume  that  the  process  graph  is  fixed  and  that  the  task  service  time  is  exponen¬ 
tially  distributed  with  a  mean  of  sec.  If  each  task  is  residing  on  a  different  processor,  then 

there  exists  a  communication  delay  between  any  two  tasks  if  there  is  a  precedence  relationship 
between  them.  We  will  treat  this  communication  cost  as  another  task  whose  service  time  is 

exponentially  distributed  (with  a  mean  of  sec.).  For  example.  Figure  6.1b  is  the  process 

f^c 

graph  obtained  when  we  add  communication  tasks  to  the  process  graph  of  Figure  6.1a. 


1 


4 

Figure  6.1a  Process  Graph 


A  communication  task  S,,  represents  the  communication  between  a  processor  where  task  t  is 
residing  and  a  processor  where  task  j  is  residing.  We  will  call  the  process  graph  with  the  com¬ 
munication  tasks  added  the  "process-communication  graph". 

Of  course,  if  a  task  (  has  several  subtasks  <2,  ■  ■  ■  ,  l^,  where  o  ^  2,  we  can  assign 
one  of  the  subtasks  on  the  same  processor  where  task  t  was  executing.  We  assume  for  now  that 
the  subtask  selected  to  reside  on  the  same  processor  with  task  ( is  picked  at  random,  and  when 
two  tasks  reside  on  the  same  processor,  the  communication  cost  between  these  two  tasks  is  sero. 
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4 


Figure  6.1b  Process*cominunicution  Grupb 


In  the  example  shown  in  Figure  6.1b,  we  can  let  tasks  1  and  2  reside  on  one  processor 
and  let  tasks  3  and  4  reside  on  another  processor.  The  resulting  proce88>communication  graph  is 
shown  in  Figure  6.2. 

Since  the  communication  task  is  treated  just  like  a  regular  task,  we  see  that  there  could 
be  concurrent  execution  of  regular  tasks  and  a  communication  task.  If  we  assume  multiple  com* 
munication  busses,  then  more  then  one  communication  task  could  execute  in  parallel  also.  In 
Section  6.3,  we  will  examine  the  case  when  only  one  communication  task  is  allowed  to  execute 
at  any  given  time. 

The  exact  communication  time  requirement  depends  upon  the  access  protocols,  the  com¬ 
munication  bandwidth,  the  volume  of  data  to  be  transmitted  and  the  physical  locations  of  the 
processors  requiring  the  communication.  Lee  |LEE77|,  discussed  in  Section  2.3.6,  approximated 
the  communication  cost  by  multiplying  the  volume  of  data  to  be  transmitted  by  the  distance  of 
the  two  processors.  His  method  assumed  that  the  volume  of  data  transmitted  between  two  pro¬ 
cessors  is  fixed  and  the  distance  between  two  processors  is  a  constant  multiplied  by  the  number 
of  hops  or  a  linear  function  of  the  physical  distance. 

In  this  chapter,  we  let  ait  where  s  is  a  real  number  greater  than  zero.  If  a  >  1, 
the  communication  task  takes  less  time  on  the  average  than  the  regular  task.  In  particular, 
when  s  M  00,  communication  cost  is  then  zero. 
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Figure  8.2  Procese-commanication  Graph 


Using  Algorithm  CPM  (Section  4.2.2),  we  can  convert  a  process^ommunication  graph 
into  a  Markov  Chain  to  solve  for  the  average  system  time.  Figure  6.3  is  the  Markov  chain  for 
the  process-communication  graph  of  Figure  6.2.  Recall  that  C,  is  a  state  where  all  the  tasks  in 
a  are  executing  concurrently. 

We  wish  to  solve  for  the  s  such  that  the  resulting  average  system  time  equals  to 
5(P  »  1).  We  denote  such  s  as  Note  that  with  a  «  a„  the  concurrency  measure  <r  equals 
one.  This  will  allow  os  to  separate  out  those  systems  which  yield  a  net  improvement  when 
parallel  processing  is  introduced. 

Using  the  balance  equations  and  the  fact  that  the  sum  of  all  equilibrium  state  probabili¬ 
ties  equals  to  one,  we  obtain  the  expression  for  a,  for  the  example  shown  in  Figure  6.3  as 

3+9o,+7oJ  1 
_  5(«.)  2e,+2o*  It 

It 

Solving  for  s„  we  get  a, »  2.3027.  In  other  words,  if  it,  >  2.3027/1,  for  the  job  represented  by 
Figure  6.1a,  multiprocessing  is  still  faster  than  the  single  processor  (even  with  the  communica¬ 
tion  delay).  But  if  it,  <  2.3027/1,  then  it  is  better  to  process  this  job  on  a  single  processor. 


ft.2  Limited  Number  of  Proeeseore  Per  Job  {  k,  G,  2',  P  <  oo  ) 

If  there  are  limited  number  of  processors  and  the  number  of  processors,  P,  assigned  to  a 
job  G  is  smaller  than  the  width  of  G,  then  the  scheduling  of  tasks  to  processors  is  required. 
Whenever  there  is  more  than  one  task  waiting  for  an  available  processor,  we  must  consider  the 
communication  costs  when  deciding  which  task  should  be  processed  on  this  processor.  If  the 
available  processor  does  not  have  the  data  necessary  for  executing  a  task,  the  data  must  be 
transmitted  from  another  processor.  Thus,  we  have  a  difficult  optimization  problem  of  assigning 
tasks  to  the  processors.  In  this  section,  we  discuss  some  simple  rules  of  thumb  in  assigning 
tasks.  These  rules  most  likely  will  not  produce  the  best  assignment  in  terms  of  minimizing  the 
average  system  time,  but  it  provides  a  basis  for  analyzing  the  effect  of  having  different  number 
of  processors  for  a  specific  process  graph. 

The  rules  of  thumb  for  assigning  tasks  to  processors  are 

1.  For  a  task  t,  if  there  is  only  one  task  j  such  that  there  is  a  precedence  relationship  (tj), 
then  assign  task  j  to  the  same  processor  where  task  t  is  processed. 

2.  For  a  task  i,  if  there  is  a  set  of  tasks  a  such  that  there  is  a  precedence  relationship  (>,j) 
where  j  t  a,  and  there  is  only  one  available  processor,  assign  all  the  tasks  a  to  the  same 
processor  where  task  i  is  processed. 

3.  For  a  task  t,  if  there  is  a  set  of  tasks  a  such  that  there  is  a  precedence  relationship  (>,;') 
where  j  i  a,  and  there  is  more  than  one  processor  available,  supposing  there  are  P'  pro¬ 
cessors  and  I  tasks  in  the  set  a,  then 

i.  for  2<P' 

assign  one  task  y  e  a  to  the  processor  where  task  i  is  processed,  create  one  com¬ 
munication  task  for  each  of  the  2~l  tasks  and  assign  each  of  them  to  a  separate 
processor. 

ii.  for  x>  P’ 

assign  (s-f^  -t-l)  tasks  to  the  processor  where  task  i  is  processed,  create  (P'  -1) 
communication  tasks  and  assign  the  rest  of  the  tasks  to  each  of  the  (P'  -1)  pro¬ 
cessors. 

4.  For  a  task  t,  if  it  has  only  one  edge  (t,  j)  leaving  it  and  task  j  has  been  assigned  to 
another  processor  p,  then  create  a  communication  task  to  transfer  the  data  to  processor 
P- 

Figures  6. 4-6.6  show  a  simple  example  of  the  task  assignment. 


Figare  8.6  Procm-communication  Graph  with  P  «  3 


Fifttre  6.4  is  the  original  process  graph.  Figure  6.5  shows  the  assignment  for  the  P  »  2  case, 
and  Figure  6.6  shows  the  assignment  for  the  P  «  3  case.  In  Figure  6.5,  tasks  1,  2  and  4  reside 
on  one  processor  while  tasks  3  and  5  reside  on  the  other  processor.  In  Figure  6.6,  tasks  1  and  2 
reside  on  the  first  processor;  tasks  3  and  5  reside  on  the  second  processor;  task  4  resides  on  the 
third  processor.  If  we  draw  circles  around  the  task  assignments  in  the  original  process  graph  of 
Figure  6.4,  any  precedence  relationship  that  connects  tasks  across  these  circles  indicates  that  a 
communication  task  is  required. 

When  the  communication  cost  is  assumed  to  be  zero  (s  ■■  oo),  we  expect  the  iiS  versus 
P  figure  to  look  like  Figure  6.7.  When  P  ^  I,  ^  N  where  N  is  the  number  of  tasks  in  the 

N* 

process  graph,  and  as  P  increases,  the  value  of  iiS  will  approach  N*  where  —  is  the  average 
system  time  5  (P  »  oo)  obtained  in  Chapter  4. 


Figure  6.7  ftS  versus  P 


In  general  as  the  cost  of  the  communication  increases  (sthaller  a),  the  average  system 
time  will  increase  also.  Figure  6.8  is  a  typical  example  of  a  family  of  curves  for  (tS  versus  P. 
For  a  specific  value  of  P,  let  us  say  P' ,  there  will  be  a  specific  s  v  s,  such  that  the  value  of  /iS 
at  these  values  of  a,  and  P'  equals  to  N.  In  other  words,  the  advantage  of  multiprocessing  on 

P'  processors  is  erased  by  the  communication  cost  (with  mean  time  of  — ^ —  )  between  proces- 

sors.  As  a  becomes  smaller  than  a  specific  e*.  the  communication  cost  becomes  too  large  for  any 
multiprocessing  at  all,  and  the  normalized  average  system  time  is  greater  than  N. 

Figures  6.9>15  show  an  example  of  the  process  graph  (Figure  6.9),  the  process* 
communication  graphs  with  various  values  of  P,  P  2,3,4,S,6,  (Figures  6.10-14),  and  the  result¬ 
ing  family  of  nS  curves  versus  P  (Figure  6.15).  Of  course  these  process-communication  graphs 
are  for  a  specific  assignment  of  tasks  to  processors.  But  the  behavior  of  the  curves  is  likely  to 
be  similar  for  all  other  assignments. 

For  some  of  the  curves,  a  horizontal  line  intersects  the  itS  curve  at  two  points  as  in  Fig¬ 
ure  6.16.  This  implies  that  the  average  system  time  is  the  same  whether  we  use  P^  processors  or 
Pj  processors.  We  are  interested,  then,  in  the  issue  of  efficiency  of  the  processors.  Since 
P}  >  P|,  and  in  both  eases  the  same  amount  of  work  is  completed  in  the  same  average  system 
time,  the  Pj  processors  must  be  idling  more  (waiting  for  the  communication  tasks  to  complete). 


Let  us  denote  the  average  system  time  when  P  »  1  to  be  Sj  and  the  average  system 
time  with  P  processors  and  with  *  ait  to  be  Sp  (a).  The  total  work  to  be  done  is  constant 

and  equals  N  —  second  which  is  also  5|.  Let  q  denotes  the  efficiency  of  the  processors  where  7 


is  defined  to  be  which  is  the  total  useful  work  done  divided  by  the  time  necessary  for 

P  Sf  (a) 

the  P  processors  to  process  this  amount  of  useful  work  multiplied  by  P.  If  we  define 

t 

—  M  (7(a),  where  C(a)  is  a  constant  for  a  specific  a,  then  C. 

Sp[a) 


n  — 


eta) 

P 


Therefore,  we  can  see  the  trend  that  the  efficiency  of  the  multiprocessors  decreases  as 
the  number  of  processors  increases. 


On«  Communlention  Bus  (k,  G,  x*.  P  <  00  ) 

In  the  last  two  sections,  we  have  assumed  that  there  is  more  than  one  communication 
bus  such  that  whenever  a  processor  needs  to  send  data  to  another  processor,  it  will  be  transmit¬ 
ted  without  having  to  wait  for  a  communication  channel.  In  this  section,  we  assume  that  there 
is  only  one  communication  bus  such  that  only  one  processor  may  be  transmitting  at  any  given 
time.  We  also  assume  that  communication  tasks  are  perfectly  Kheduled  such  that  if  there  is 
more  than  one  communication  task  required  to  be  transmitted,  then  each  of  them  will  be 
transmitted  in  turn  and  each  of  them  knows  who  is  to  be  transmitted  next  (no  collisions). 

To  obtain  the  average  system  time,  we  again  convert  the  process-communication  graph 
into  a  Markov  chain.  The  states  of  the  Markov  chain  are  represented  as  Df  where  the  set  a 
contains  tasks  to  be  executed  in  parallel,  and  the  set  contains  all  the  communication  tasks 
waiting  to  be  transmitted.  If  the  set  9  is  not  empty,  then  one  of  the  tasks  in  a  must  be  a  com¬ 
munication  task.  After  the  communication  task  in  a  finishes,  we  can  activate  one  of  the  tasks 
in  $.  The  conversion  algorithm  is  same  as  Algorithm  CPM  of  Section  4.2.1,  with  the 
modification  that  each  time  a  task  t  completes  the  execution,  we  add  the  following  two  tasks: 

1.  as  the  result  of  the  completion  of  task  i,  if  there  is  a  set  of  communication  tasks  becom¬ 
ing  active,  concatenate  them  into  the  set  9 

2.  if  task  t  was  a  communication  task,  and  if  the  set  9  is  not  empty,  bring  one  of  the  task 
in  set  9  into  the  set  a. 
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After  the  Markov  chain  is  constructed,  we  can  solve  for  the  equilibrium  state  probabili¬ 
ties  and  the  average  system  time  using  the  same  method  studied  in  Chapter  i.  We  expect  that 
the  average  system  time  with  one  communication  bus  is  greater  than  the  multi-communication 
busses  because  of  the  delay  caused  by  the  non  parallel  processing  of  the  communication  tasks. 

For  example,  Figure  6.18  is  one  of  the  Markov  chains  that  can  be  converted  from  the 
process  communication  graph  of  Figure  6.17  where  tasks  3,  4,  5,  and  8  are  the  communication 
tasks.  After  the  task  2  in  the  state  C-j  completes,  the  communication  task  5  is  concatenated 
to  the  set  0,  communication  task  4  cannot  be  moved  to  the  set  a  because  task  3  is  a  communi¬ 
cation  task  and  there  is  only  one  communication  bus  available;  after  the  communication  task  3 
in  the  state  Cjj  completes,  one  of  the  communication  tasks  in  the  set  0  can  be  activated,  and 
task  6  can  also  start  execution  (since  it  has  received  the  information  from  the  communication 
task  3.) 


Figure  6.17  A  Process  Communication  Graph 


6.4  DbeuMion 


In  this  chapter,  we  added  the  communication  overhead  to  the  process  gr^hs.  This 
overhead  is  represented  as  new  nodes  in  a  process  graph.  The  resulting  graph  is  called  the 
process-communication  graph  and  it  may  be  converted  into  a  Markov  Chain  to  obtain  the  aver¬ 
age  system  time. 

When  the  number  of  processors  is  limited  to  P,  'tit  presented  some  rules  of  thumb  on 
how  tasks  should  be  assigned  to  processors  and  where  communication  tasks  must  be  added. 
This  assignment  was  by  no  means  the  'optimal'  assignment.  It  was  used  so  that  we  could 
analyze  the  processor  communication  graph  and  study  a  few  facts  regarding  it.  For  example,  we 
found  the  behavior  of  the  nScunt  when  plotted  against  P  and  how  it  behaves  when  the  param¬ 
eter  a  increases.  We  also  found  the  efficiency  of  the  P  processors  which  worked  on  a  job. 

Finally,  we  studied  the  communication  problem  when  only  one  communication  bus  is 
allowed.  The  difference  between  this  case  and  the  two  previous  cases  was  when  converting  a 
process-graph  into  a  Markov  Chain,  we  cannot  let  more  than  two  communication  tasks  to  be 
executed  at  the  same  time.  The  additional  communication  tasks  were  kept  in  a  first  come  first 
served  queue. 
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